Linode

Cleaning And Curating Open Data For Archaeology - Episode 68

Summary

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

  • Introduction

  • How did you get involved in the area of data management?

    I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

  • Can you start by describing what Open Context is and how it started?

    Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

  • What are your protocols for determining which data sets you will work with?

    Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

  • What are some of the challenges unique to research data?

    • What are some of the unique requirements for processing, publishing, and archiving research data?

      You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

      Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

  • How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

    We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

  • Can you describe the system architecture that you use for Open Context?

    Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

  • What is the process for cleaning and formatting the data that you host?

    • How much domain expertise is necessary to ensure proper conversion of the source data?

      That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.

    • Can you discuss the challenges that you face in maintaining a consistent ontology?

    • What pieces of metadata do you track for a given data set?

  • Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?

    • Can you walk through the lifecycle of a given data set?
  • Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?

  • Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?

  • What are some of the most interesting uses you have seen of the data that is hosted on Open Context?

  • What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?

  • What are your goals for the future of Open Context?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Managing Database Access Control For Teams With strongDM - Episode 67

Summary

Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Justin McCarthy about StrongDM, a hosted service that simplifies access controls for your data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining the problem that StrongDM is solving and how the company got started?
    • What are some of the most common challenges around managing access and authentication for data storage systems?
    • What are some of the most interesting workarounds that you have seen?
    • Which areas of authentication, authorization, and auditing are most commonly overlooked or misunderstood?
  • Can you describe the architecture of your system?
    • What strategies have you used to enable interfacing with such a wide variety of storage systems?
  • What additional capabilities do you provide beyond what is natively available in the underlying systems?
  • What are some of the most difficult aspects of managing varying levels of permission for different roles across the diversity of platforms that you support, given that they each have different capabilities natively?
  • For a customer who is onboarding, what is involved in setting up your platform to integrate with their systems?
  • What are some of the assumptions that you made about your problem domain and market when you first started which have been disproven?
  • How do organizations in different industries react to your product and how do their policies around granting access to data differ?
  • What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of building and growing StrongDM?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Enterprise Big Data Systems At LEGO - Episode 66

Summary

Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper S√łgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?
    • What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data?
    • What was the transition process like, migrating data silos into a uniformly managed platform?
  • What are the biggest data challenges that you face at LEGO?
  • What are some of the most critical sources and types of data that you are managing?
  • What are the main components of the data infrastructure that you have built to support the organizations analytical needs?
    • What are some of the technologies that you have found to be most useful?
    • Which have been the most problematic?
  • What does the team structure look like for the data services at LEGO?
    • Does that reflect in the types/numbers of systems that you support?
  • What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support?
  • What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO?
  • How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed?
  • How does the global nature of the LEGO business influence the design strategies and technology choices for your platform?
  • What are you most excited for in the coming year?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you refresh our memory about what TimescaleDB is?
  • How has the market for timeseries databases changed since we last spoke?
  • What has changed in the focus and features of the TimescaleDB project and company?
  • Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
    • What were the most challenging aspects of reaching that goal?
  • In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
    • How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
  • What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?
  • How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
    • Have you been able to leverage some of the native improvements to simplify your implementation?
    • Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
  • What is in store for the future of the Timescale product and organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Kudu is and the motivation for building it?
    • How does it fit into the Hadoop ecosystem?
    • How does it compare to the work being done on the Iceberg table format?
  • What are some of the common application and system design patterns that Kudu supports?
  • How is Kudu architected and how has it evolved over the life of the project?
  • There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu?
  • How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?
    • What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?
  • A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure?
  • What was the motivation for using C++ as the language target for Kudu?
    • If you were to start the project over today what would you do differently?
  • What are some situations where you would advise against using Kudu?
  • What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu?
  • What are you most excited about for the future of Kudu?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Summary

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pravega is and the story behind it?
  • What are the use cases for Pravega and how does it fit into the data ecosystem?
    • How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?
  • How do you represent a stream on-disk?
    • What are the benefits of using this format for persisted streams?
  • One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?
  • I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?
  • For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?
  • What are some of the unique system design patterns that are made possible by Pravega?
  • How is Pravega architected internally?
  • What is involved in integrating engines such as Spark, Flink, or Storm with Pravega?
  • A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?
    • Does it have any special capabilities for simplifying processing of out-of-order events?
  • For someone planning a deployment of Pravega, what is involved in building and scaling a cluster?
    • What are some of the operational edge cases that users should be aware of?
  • What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?
  • What are some cases where you would recommend against using Pravega?
  • What is in store for the future of Pravega?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what PipelineDB is and the motivation for creating it?
    • What are the major use cases that it enables?
    • What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
  • What are the major concepts and components that users of PipelineDB should be familiar with?
  • Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
  • What are some of the common patterns for populating data streams?
  • What are the options for scaling PipelineDB systems, both vertically and horizontally?
    • How much elasticity does the system support in terms of changing volumes of inbound data?
    • What are some of the limitations or edge cases that users should be aware of?
  • Given that inbound data is not persisted to disk, how do you guard against data loss?
    • Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
    • Can a separate table be used as an input stream?
  • Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
  • What are some of the features that you have found to be the most useful which users might initially overlook?
  • What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
  • What are some of the most challenging aspects of building continuous aggregates on unbounded data?
  • What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
  • What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
  • When is PipelineDB the wrong choice?
  • What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of a data pipeline?
    • At what point in the life of a project or organization should you start thinking about building a pipeline?
  • In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
    • What metrics/use cases should you be optimizing for at this point?
  • What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
    • How do the design requirements for a data pipeline change as you reach this stage?
    • What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
  • What are some of the changes that are necessary as you move to a large scale data pipeline?
  • At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
  • In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
  • When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
  • Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
  • What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
    • How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
  • What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
  • What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
  • What are your plans for improving your current pipeline at Grubhub?
  • What are some references that you recommend for anyone who is designing a new data platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Spark is?
    • What are some of the main use cases for Spark?
    • What are some of the problems that Spark is uniquely suited to address?
    • Who uses Spark?
  • What are the tools offered to Spark users?
  • How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
  • For someone building on top of Spark what are the main software design paradigms?
    • How does the design of an application change as you go from a local development environment to a production cluster?
  • Once your application is written, what is involved in deploying it to a production environment?
  • What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
  • What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
  • What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
  • What are the limitations of the Spark programming model?
    • What are the cases where Spark is the wrong choice?
  • What was your motivation for writing a book about Spark?
    • Who is the target audience?
  • What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
  • What advice do you have for anyone who is considering or currently using Spark?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

  • Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Zookeeper is and how the project got started?
    • What are the main motivations for using a centralized coordination service for distributed systems?
  • What are the distributed systems primitives that are built into Zookeeper?
    • What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
    • What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
  • Can you discuss how Zookeeper is architected and how that design has evolved over time?
    • What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
  • What are the scaling factors for Zookeeper?
    • What are the edge cases that users should be aware of?
    • Where does it fall on the axes of the CAP theorem?
  • What are the main failure modes for Zookeeper?
    • How much of the recovery logic is left up to the end user of the Zookeeper cluster?
  • Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
  • In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
    • What are some of the cases where Zookeeper is the wrong choice?
  • How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
  • If you were to start the project over today, what would you do differently?
    • Would you still use Java?
  • What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
  • What do you have planned for the future of Zookeeper?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA