Linode

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Flink is and how the project got started?
  • What are some of the primary ways that Flink is used?
  • How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
    • What are some use cases that Flink is uniquely qualified to handle?
  • Where does Flink fit into the current data landscape?
  • How is Flink architected?
    • How has that architecture evolved?
    • Are there any aspects of the current design that you would do differently if you started over today?
  • How does scaling work in a Flink deployment?
    • What are the scaling limits?
    • What are some of the failure modes that users should be aware of?
  • How is the statefulness of a cluster managed?
    • What are the mechanisms for managing conflicts?
    • What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
    • Can state be shared across processes or tasks within a Flink cluster?
  • What are the comparative challenges of working with bounded vs unbounded streams of data?
  • How do you handle out of order events in Flink, especially as the delay for a given event increases?
  • For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
  • What are some of the most challenging or complicated aspects of building and maintaining Flink?
  • What are some of the most interesting or unexpected ways that you have seen Flink used?
  • What are some of the improvements or new features that are planned for the future of Flink?
  • What are some features or use cases that you are explicitly not planning to support?
  • For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
    • What do they find most interesting or exciting?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Upsolver is and how it got started?
    • What are your goals for the platform?
  • There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
    • What are the shortcomings of a data lake architecture?
  • How is Upsolver architected?
    • How has that architecture changed over time?
    • How do you manage schema validation for incoming data?
    • What would you do differently if you were to start over today?
  • What are the biggest challenges at each of the major stages of the data lake?
  • What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?
  • When is Upsolver the wrong choice for an organization considering implementation of a data platform?
  • Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?
  • What features or improvements do you have planned for the future of Upsolver?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Looker is and the problem that it is aiming to solve?
    • How do you define business intelligence?
  • How is Looker unique from other approaches to business intelligence in the enterprise?
    • How does it compare to open source platforms for BI?
  • Can you describe the technical infrastructure that supports Looker?
  • Given that you are connecting to the customer’s data store, how do you ensure sufficient security?
  • For someone who is using Looker, what does their workflow look like?
    • How does that change for different user roles (e.g. data engineer vs sales management)
  • What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?
  • What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
    • What are the portions of the Looker architecture that you would do differently if you were to start over today?
  • What are some of the most interesting or unusual uses of Looker that you have seen?
  • What is in store for the future of Looker?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
    • Where are you using notebooks and where are you not?
  • What is the technical infrastructure that you have built to suppport that design choice?
  • Which team was driving the effort?
    • Was it difficult to get buy in across teams?
  • How much shared code have you been able to consolidate or reuse across teams/roles?
  • Have you investigated the use of any of the other notebook platforms for similar workflows?
  • What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
  • What are some of the limitations of the notebook environment for the work that you are doing?
  • What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
  • What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Iceberg is and the motivation for creating it?
    • Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
  • How has the use of Iceberg simplified your work at Netflix?
  • How is the reference implementation architected and how has it evolved since you first began work on it?
    • What is involved in deploying it to a user’s environment?
  • For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
    • Is there a migration path for pre-existing tables into the Iceberg format?
  • How is schema evolution managed at the file level?
    • How do you handle files on disk that don’t contain all of the fields specified in a table definition?
  • One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
  • What are the unique challenges posed by using S3 as the basis for a data lake?
    • What are the benefits that outweigh the difficulties?
  • What have been some of the most challenging or contentious details of the specification to define?
    • What are some things that you have explicitly left out of the specification?
  • What are your long-term goals for the Iceberg specification?
    • Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Summary

One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what MemSQL is and how the product and business first got started?
  • What are the typical use cases for customers running MemSQL?
  • What are the benefits of integrating the ingestion pipeline with the database engine?
    • What are some typical ways that the ingest capability is leveraged by customers?
  • How is MemSQL architected and how has the internal design evolved from when you first started working on it?
    • Where does it fall on the axes of the CAP theorem?

    • How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?

    • Can you describe the lifecycle of a write transaction?
  • Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?

    • How do you mitigate the impact of network latency throughout the cluster during query planning and execution?
  • How much of the implementation of MemSQL is using custom built code vs. open source projects?

  • What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?
  • What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?
  • When is MemSQL the wrong choice for a data platform?
  • What do you have planned for the future of MemSQL?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
    • How do you define the concept of a knowledge graph?
  • What are the processes involved in constructing a knowledge graph?
  • Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
  • What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
    • How do you manage the software lifecycle for your ETL code?
    • What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
  • What are the current challenges that you are facing in building and scaling your data infrastructure?
    • How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
    • What techniques are you using to manage accuracy and consistency in the data that you ingest?
  • Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
  • What are the weak spots in your platform that you are planning to address in upcoming projects?
    • If you were to start from scratch today, what would you have done differently?
  • What are some of the most interesting or unexpected uses of your product that you have seen?
  • What is in store for the future of Enigma?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA