Open Source

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
    • Where are you using notebooks and where are you not?
  • What is the technical infrastructure that you have built to suppport that design choice?
  • Which team was driving the effort?
    • Was it difficult to get buy in across teams?
  • How much shared code have you been able to consolidate or reuse across teams/roles?
  • Have you investigated the use of any of the other notebook platforms for similar workflows?
  • What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
  • What are some of the limitations of the notebook environment for the work that you are doing?
  • What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
  • What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Iceberg is and the motivation for creating it?
    • Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
  • How has the use of Iceberg simplified your work at Netflix?
  • How is the reference implementation architected and how has it evolved since you first began work on it?
    • What is involved in deploying it to a user’s environment?
  • For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
    • Is there a migration path for pre-existing tables into the Iceberg format?
  • How is schema evolution managed at the file level?
    • How do you handle files on disk that don’t contain all of the fields specified in a table definition?
  • One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
  • What are the unique challenges posed by using S3 as the basis for a data lake?
    • What are the benefits that outweigh the difficulties?
  • What have been some of the most challenging or contentious details of the specification to define?
    • What are some things that you have explicitly left out of the specification?
  • What are your long-term goals for the Iceberg specification?
    • Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

  • Introductions
  • How did you get involved in the area of data engineering and data management?
  • What is Snowplow Analytics and what problem were you trying to solve when you started the company?
  • What is unique about customer event data from an ingestion and processing perspective?
  • Challenges with properly matching up data between sources
  • Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
    • Cleanliness/accuracy
  • What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
  • Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
    • How has that architecture evolved from when you first started?
    • What would you do differently if you were to start over today?
  • Ensuring appropriate use of enrichment sources
  • What have been some of the biggest challenges encountered while building and evolving Snowplow?
  • What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.
  • Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience
    as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is DGraph and what motivated you to build it?
  • Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?
    • The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?
  • What are some of the common uses of graph storage systems?
    • What are some potential uses that are often overlooked?
  • There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?
  • How does the query interface and data storage in DGraph differ from other options?
    • What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?
  • How is DGraph architected and how has that architecture evolved from when it first started?
  • How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?
  • In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?
  • What are the limitations of DGraph in terms of scalability or usability?
  • Where does it fall along the axes of the CAP theorem?
  • For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?
  • What have been the most challenging aspects of building and growing the DGraph project and community?
  • What are some of the most interesting or unexpected uses of DGraph that you are aware of?
  • When is DGraph the wrong choice?
  • What are your plans for the future of DGraph?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • How did you get involved in the Postgres project?
  • For anyone who hasn’t used it, can you describe what PostgreSQL is?
    • Where did Postgres get started and how has it evolved over the intervening years?
  • What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?
    • What are some cases where Postgres is the wrong choice?
  • What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience)
  • The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities?
  • What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer?
  • Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB?
  • What is in store for the future of Postgres?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Ona and how did the company get started?
    • What are some examples of the types of customers that you work with?
  • What types of data do you support in your collection platform?
  • What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
  • Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
  • What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
  • Can you describe the flow of the data from collection through to analysis?
  • To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
    • What are the architectural considerations that you factored in when designing it?
    • What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
  • What are your plans for the future of Ona and Canopy?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

Summary

When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start with an overview of what Ceph is?
    • What was the motivation for starting the project?
    • What are some of the most common use cases for Ceph?
  • There are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)?
  • Given that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions?
    • What mechanisms are available to ensure data integrity across the cluster?
  • How is Ceph implemented and how has the design evolved over time?
  • What is required to deploy and manage a Ceph cluster?
    • What are the scaling factors for a cluster?
    • What are the limitations?
  • How does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files?
  • In services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster?
  • In what situations would you advise someone against using Ceph?
  • What are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community?
  • What are some of the plans that you have for the future of Ceph?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what NiFi is?
  • What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code?
  • How did you get involved with the project?
    • Where does it sit in the broader landscape of data tools?
  • Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?
    • How do you manage versioning and backup of data flows, as well as promoting them between environments?
  • One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?
    • What types of reporting are available across this information?
  • What are some of the use cases or requirements that lend themselves well to being solved by NiFi?
    • When is NiFi the wrong choice?
  • What is involved in deploying and scaling a NiFi installation?
    • What are some of the system/network parameters that should be considered?
    • What are the scaling limitations?
  • What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community?
  • What do you have planned for the future of NiFi?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is the intended use case for Quilt and how did the project get started?
  • Can you step through a typical workflow of someone using Quilt?
    • How does that change as you go from a single user to a team of data engineers and data scientists?
  • Can you describe the elements of what a data package consists of?
    • What was your criteria for the file formats that you chose?
  • How is Quilt architected and what have been the most significant changes or evolutions since you first started?
  • How is the data registry implemented?
    • What are the limitations or edge cases that you have run into?
    • What optimizations have you made to accelerate synchronization of the data to and from the repository?
  • What are the limitations in terms of data volume, format, or usage?
  • What is your goal with the business that you have built around the project?
  • What are your plans for the future of Quilt?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

CockroachDB In Depth with Peter Mattis - Episode 35

Summary

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was the motivation for creating CockroachDB and building a business around it?
  • Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions?
    • What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions?
    • What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?
  • Go is an unconventional language for building a database. What are the pros and cons of that choice?
  • What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?
    • What are the edge cases and failure modes that users should be aware of?
  • I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?
    • What are some examples of extensions that are specific to CockroachDB?
  • What are some of the most interesting uses of CockroachDB that you have seen?
  • When is CockroachDB the wrong choice?
  • What do you have planned for the future of CockroachDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA