Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey

444 Episodes

Putting Airflow Into Production With James Meickle - Episode 43 - E43

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent…

Summary

The theory behind how a tool is supposed to work and the realities of putting it…

13 August 2018 | 00:48:06


Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42 - E42

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet…

Summary

One of the longest running and most popular open source database projects is…

06 August 2018 | 00:56:22


Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41 - E41

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this…

Summary

With the attention being paid to the systems that power large volumes of high…

30 July 2018 | 00:29:14


Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40 - E40

Summary

When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and…

Summary

When working with large volumes of data that you need to access in parallel…

16 July 2018 | 00:48:31


Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39 - E39

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide…

Summary

Data integration and routing is a constantly evolving problem and one that is…

08 July 2018 | 01:04:16


Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38 - E38

Summary

Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled…

Summary

Data is often messy or incomplete, requiring human intervention to make sense of…

02 July 2018 | 00:46:14


Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37 - E37

Summary

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and…

Summary

Collaboration, distribution, and installation of software projects is largely a…

25 June 2018 | 00:41:43


User Analytics In Depth At Heap with Dan Robinson - Episode 36 - E36

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every…

Summary

Web and mobile analytics are an important part of any business, and difficult to…

17 June 2018 | 00:45:27


CockroachDB In Depth with Peter Mattis - Episode 35 - E35

Summary

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address…

Summary

With the increased ease of gaining access to servers in data centers across the…

11 June 2018 | 00:43:41


ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34 - E34

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the…

Summary

Using a multi-model database in your applications can greatly reduce the amount…

04 June 2018 | 00:40:05