Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey

459 Episodes

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18 - E18

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay…

Summary

As communications between machines become more commonplace the need to store the…

11 February 2018 | 01:02:40


Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17 - E17

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on…

Summary

One of the critical components for modern data infrastructure is a scalable and…

04 February 2018 | 00:53:47


Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16 - E16

Summary

Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how…

Summary

Sharing data across multiple computers, particularly when it is large and…

29 January 2018 | 01:02:58


Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15 - E15

Summary

The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode…

Summary

The majority of the conversation around machine learning and big data pertains…

22 January 2018 | 00:37:13


CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14 - E14

Summary

As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to…

Summary

As we scale our systems to handle larger volumes of data, geographically…

15 January 2018 | 00:45:43


Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13 - E13

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with…

Summary

PostGreSQL has become one of the most popular and widely used databases, and for…

08 January 2018 | 00:46:44


Wallaroo with Sean T. Allen - Episode 12 - E12

Summary

Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this…

Summary

Data oriented applications that need to operate on large, fast-moving sterams of…

25 December 2017 | 00:59:13


SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11 - E11

Summary

Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the…

Summary

Time series databases have long been the cornerstone of a robust metrics system,…

18 December 2017 | 00:33:52


Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10 - E10

Summary

To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode…

Summary

To process your data you need to know what shape it has, which is why schemas…

10 December 2017 | 00:49:22


data.world with Bryon Jacob - Episode 9 - E9

Summary

We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of…

Summary

We have tools and platforms for collaborating on software projects and linking…

03 December 2017 | 00:46:24