Open Source

The Alluxio Distributed Storage System - Episode 70

Summary

Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Alluxio is and the history of the project?
    • What are some of the use cases that Alluxio enables?
  • How is Alluxio implemented and how has its architecture evolved over time?
    • What are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers?
  • When dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management?
  • What are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms?
    • What are the tradeoffs that are made to provide a single API across systems with varying capabilities?
  • Testing and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process?
    • In order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects?
  • Can you describe a typical system topology that incorporates Alluxio?
  • For someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies?
    • What are some edge cases or operational complexities that they should be aware of?
  • What are some cases where Alluxio is the wrong choice?
    • What are some projects or products that provide a similar capability to Alluxio?
  • What do you have planned for the future of the Alluxio project and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Cleaning And Curating Open Data For Archaeology - Episode 68

Summary

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

  • Introduction

  • How did you get involved in the area of data management?

    I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

  • Can you start by describing what Open Context is and how it started?

    Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

  • What are your protocols for determining which data sets you will work with?

    Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

  • What are some of the challenges unique to research data?

    • What are some of the unique requirements for processing, publishing, and archiving research data?

      You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

      Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

  • How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

    We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

  • Can you describe the system architecture that you use for Open Context?

    Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

  • What is the process for cleaning and formatting the data that you host?

    • How much domain expertise is necessary to ensure proper conversion of the source data?

      That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.

    • Can you discuss the challenges that you face in maintaining a consistent ontology?

    • What pieces of metadata do you track for a given data set?

  • Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?

    • Can you walk through the lifecycle of a given data set?
  • Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?

  • Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?

  • What are some of the most interesting uses you have seen of the data that is hosted on Open Context?

  • What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?

  • What are your goals for the future of Open Context?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you refresh our memory about what TimescaleDB is?
  • How has the market for timeseries databases changed since we last spoke?
  • What has changed in the focus and features of the TimescaleDB project and company?
  • Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
    • What were the most challenging aspects of reaching that goal?
  • In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
    • How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
  • What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?
  • How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
    • Have you been able to leverage some of the native improvements to simplify your implementation?
    • Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
  • What is in store for the future of the Timescale product and organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Kudu is and the motivation for building it?
    • How does it fit into the Hadoop ecosystem?
    • How does it compare to the work being done on the Iceberg table format?
  • What are some of the common application and system design patterns that Kudu supports?
  • How is Kudu architected and how has it evolved over the life of the project?
  • There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu?
  • How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?
    • What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?
  • A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure?
  • What was the motivation for using C++ as the language target for Kudu?
    • If you were to start the project over today what would you do differently?
  • What are some situations where you would advise against using Kudu?
  • What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu?
  • What are you most excited about for the future of Kudu?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Summary

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pravega is and the story behind it?
  • What are the use cases for Pravega and how does it fit into the data ecosystem?
    • How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?
  • How do you represent a stream on-disk?
    • What are the benefits of using this format for persisted streams?
  • One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?
  • I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?
  • For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?
  • What are some of the unique system design patterns that are made possible by Pravega?
  • How is Pravega architected internally?
  • What is involved in integrating engines such as Spark, Flink, or Storm with Pravega?
  • A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?
    • Does it have any special capabilities for simplifying processing of out-of-order events?
  • For someone planning a deployment of Pravega, what is involved in building and scaling a cluster?
    • What are some of the operational edge cases that users should be aware of?
  • What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?
  • What are some cases where you would recommend against using Pravega?
  • What is in store for the future of Pravega?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what PipelineDB is and the motivation for creating it?
    • What are the major use cases that it enables?
    • What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
  • What are the major concepts and components that users of PipelineDB should be familiar with?
  • Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
  • What are some of the common patterns for populating data streams?
  • What are the options for scaling PipelineDB systems, both vertically and horizontally?
    • How much elasticity does the system support in terms of changing volumes of inbound data?
    • What are some of the limitations or edge cases that users should be aware of?
  • Given that inbound data is not persisted to disk, how do you guard against data loss?
    • Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
    • Can a separate table be used as an input stream?
  • Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
  • What are some of the features that you have found to be the most useful which users might initially overlook?
  • What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
  • What are some of the most challenging aspects of building continuous aggregates on unbounded data?
  • What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
  • What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
  • When is PipelineDB the wrong choice?
  • What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Spark is?
    • What are some of the main use cases for Spark?
    • What are some of the problems that Spark is uniquely suited to address?
    • Who uses Spark?
  • What are the tools offered to Spark users?
  • How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
  • For someone building on top of Spark what are the main software design paradigms?
    • How does the design of an application change as you go from a local development environment to a production cluster?
  • Once your application is written, what is involved in deploying it to a production environment?
  • What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
  • What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
  • What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
  • What are the limitations of the Spark programming model?
    • What are the cases where Spark is the wrong choice?
  • What was your motivation for writing a book about Spark?
    • Who is the target audience?
  • What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
  • What advice do you have for anyone who is considering or currently using Spark?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

  • Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Zookeeper is and how the project got started?
    • What are the main motivations for using a centralized coordination service for distributed systems?
  • What are the distributed systems primitives that are built into Zookeeper?
    • What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
    • What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
  • Can you discuss how Zookeeper is architected and how that design has evolved over time?
    • What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
  • What are the scaling factors for Zookeeper?
    • What are the edge cases that users should be aware of?
    • Where does it fall on the axes of the CAP theorem?
  • What are the main failure modes for Zookeeper?
    • How much of the recovery logic is left up to the end user of the Zookeeper cluster?
  • Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
  • In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
    • What are some of the cases where Zookeeper is the wrong choice?
  • How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
  • If you were to start the project over today, what would you do differently?
    • Would you still use Java?
  • What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
  • What do you have planned for the future of Zookeeper?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Flink is and how the project got started?
  • What are some of the primary ways that Flink is used?
  • How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
    • What are some use cases that Flink is uniquely qualified to handle?
  • Where does Flink fit into the current data landscape?
  • How is Flink architected?
    • How has that architecture evolved?
    • Are there any aspects of the current design that you would do differently if you started over today?
  • How does scaling work in a Flink deployment?
    • What are the scaling limits?
    • What are some of the failure modes that users should be aware of?
  • How is the statefulness of a cluster managed?
    • What are the mechanisms for managing conflicts?
    • What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?
    • Can state be shared across processes or tasks within a Flink cluster?
  • What are the comparative challenges of working with bounded vs unbounded streams of data?
  • How do you handle out of order events in Flink, especially as the delay for a given event increases?
  • For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?
  • What are some of the most challenging or complicated aspects of building and maintaining Flink?
  • What are some of the most interesting or unexpected ways that you have seen Flink used?
  • What are some of the improvements or new features that are planned for the future of Flink?
  • What are some features or use cases that you are explicitly not planning to support?
  • For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
    • What do they find most interesting or exciting?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA