Databases

Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80

Summary

The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you explain what FoundationDB is and how you got involved with the project?
  • What are some of the unique use cases that FoundationDB enables?
  • Can you describe how FoundationDB is architected?
    • How is the ACID compliance implemented at the cluster level?
  • What are some of the mechanisms built into FoundationDB that contribute to its fault tolerance?
    • How are conflicts managed?
  • FoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available?
    • Is it possible to apply different layers, such as relational and document, to the same underlying objects in storage?
  • One of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput?
  • For someone who wants to run FoundationDB can you describe a typical deployment topology?
    • What are the scaling factors for the underlying storage and for the Layers that are operating on the cluster?
  • Once you have a cluster deployed, what are some of the edge cases that users should watch out for?
    • How are version upgrades managed in a cluster?
  • What are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB?
  • What are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used?
  • When is FoundationDB the wrong choice?
  • What is in store for the future of FoundationDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Running Your Database On Kubernetes With KubeDB - Episode 79

Summary

Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tamal Saha about KubeDB, a project focused on making running production-grade databases easy on Kubernetes

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what KubeDB is and how the project got started?
  • What are the main challenges associated with running a stateful system on top of Kubernetes?
    • Why would someone want to run their database on a container platform rather than on a dedicated instance or with a hosted service?
  • Can you describe how KubeDB is implemented and how that has evolved since you first started working on it?
  • Can you talk through how KubeDB simplifies the process of deploying and maintaining databases?
  • What is involved in adding support for a new database?
    • How do the requirements change for systems that are natively clustered?
  • How does KubeDB help with maintenance processes around upgrading existing databases to newer versions?
  • How does the work that you are doing on KubeDB compare to what is available in StorageOS?
    • Are there any other projects that are targeting similar goals?
  • What have you found to be the most interesting/challenging/unexpected aspects of building KubeDB?
  • What do you have planned for the future of the project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Unpacking Fauna: A Global Scale Cloud Native Database - Episode 78

Summary

One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what FaunaDB is and how it got started?
  • What are some of the main use cases that FaunaDB is targeting?
    • How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?
  • Can you describe the architecture of FaunaDB and how it has evolved?
  • The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?
    • What are some of the edge cases that users should be aware of?
    • How are conflicts managed in Fauna?
  • What is the underlying storage layer?
    • How is the query layer designed to allow for different query patterns and model representations?
  • How does data modeling in Fauna compare to that of relational or document databases?
    • Can you describe the query format?
    • What are some of the common difficulties or points of confusion around interacting with data in Fauna?
  • What are some application design patterns that are enabled by using Fauna as the storage layer?
  • Given the ability to replicate globally, how do you mitigate latency when interacting with the database?
  • What are some of the most interesting or unexpected ways that you have seen Fauna used?
  • When is it the wrong choice?
  • What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company?
  • What do you have in store for the future of Fauna?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Index Your Big Data With Pilosa For Faster Analytics - Episode 77

Summary

Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Pilosa is and how the project got started?
  • Where does Pilosa fit into the overall data ecosystem and how does it integrate into an existing stack?
  • What types of use cases is Pilosa uniquely well suited for?
  • The Pilosa data model is fairly unique. Can you talk through how it is represented and implemented?
  • What are some approaches to modeling data that might be coming from a relational database or some structured flat files?
    • How do you handle highly dimensional data?
  • What are some of the decisions that need to be made early in the modeling process which could have ramifications later on in the lifecycle of the project?
  • What are the scaling factors of Pilosa?
  • What are some of the most interesting/challenging/unexpected lessons that you have learned in the process of building Pilosa?
  • What is in store for the future of Pilosa?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you refresh our memory about what TimescaleDB is?
  • How has the market for timeseries databases changed since we last spoke?
  • What has changed in the focus and features of the TimescaleDB project and company?
  • Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
    • What were the most challenging aspects of reaching that goal?
  • In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
    • How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
  • What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?
  • How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
    • Have you been able to leverage some of the native improvements to simplify your implementation?
    • Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
  • What is in store for the future of the Timescale product and organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what PipelineDB is and the motivation for creating it?
    • What are the major use cases that it enables?
    • What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
  • What are the major concepts and components that users of PipelineDB should be familiar with?
  • Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
  • What are some of the common patterns for populating data streams?
  • What are the options for scaling PipelineDB systems, both vertically and horizontally?
    • How much elasticity does the system support in terms of changing volumes of inbound data?
    • What are some of the limitations or edge cases that users should be aware of?
  • Given that inbound data is not persisted to disk, how do you guard against data loss?
    • Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
    • Can a separate table be used as an input stream?
  • Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
  • What are some of the features that you have found to be the most useful which users might initially overlook?
  • What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
  • What are some of the most challenging aspects of building continuous aggregates on unbounded data?
  • What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
  • What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
  • When is PipelineDB the wrong choice?
  • What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Iceberg is and the motivation for creating it?
    • Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
  • How has the use of Iceberg simplified your work at Netflix?
  • How is the reference implementation architected and how has it evolved since you first began work on it?
    • What is involved in deploying it to a user’s environment?
  • For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
    • Is there a migration path for pre-existing tables into the Iceberg format?
  • How is schema evolution managed at the file level?
    • How do you handle files on disk that don’t contain all of the fields specified in a table definition?
  • One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
  • What are the unique challenges posed by using S3 as the basis for a data lake?
    • What are the benefits that outweigh the difficulties?
  • What have been some of the most challenging or contentious details of the specification to define?
    • What are some things that you have explicitly left out of the specification?
  • What are your long-term goals for the Iceberg specification?
    • Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.
  • Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience
    as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is DGraph and what motivated you to build it?
  • Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?
    • The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?
  • What are some of the common uses of graph storage systems?
    • What are some potential uses that are often overlooked?
  • There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?
  • How does the query interface and data storage in DGraph differ from other options?
    • What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?
  • How is DGraph architected and how has that architecture evolved from when it first started?
  • How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?
  • In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?
  • What are the limitations of DGraph in terms of scalability or usability?
  • Where does it fall along the axes of the CAP theorem?
  • For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?
  • What have been the most challenging aspects of building and growing the DGraph project and community?
  • What are some of the most interesting or unexpected uses of DGraph that you are aware of?
  • When is DGraph the wrong choice?
  • What are your plans for the future of DGraph?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • How did you get involved in the Postgres project?
  • For anyone who hasn’t used it, can you describe what PostgreSQL is?
    • Where did Postgres get started and how has it evolved over the intervening years?
  • What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?
    • What are some cases where Postgres is the wrong choice?
  • What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience)
  • The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities?
  • What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer?
  • Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB?
  • What is in store for the future of Postgres?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving a brief overview of Heap?
  • One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?
  • Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?
  • Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?
    • What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?
  • What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?
    • What challenges does that pose in your processing architecture?
  • What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?
    • How has that architecture changed or evolved over the life of the company?
    • What are some changes that you are anticipating in the near future?
  • Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?
  • What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?
  • What changes have been necessary as a result of GDPR?
  • What are your plans for the future of Heap?

Contact Info

  • @danlovesproofs on twitter
  • [email protected]
  • @drob on github
  • heapanalytics.com / @heap on twitter
  • https://heapanalytics.com/blog/category/engineering

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA