Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey

485 Episodes

A High Performance Platform For The Full Big Data Lifecycle - E94

Summary

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully…

Summary

Managing big data projects at scale is a…

19 August 2019 | 01:13:46


Digging Into Data Replication At Fivetran - E93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work…

Summary

The extract and load pattern of data…

12 August 2019 | 00:44:41


Solving Data Discovery At Lyft - E92

Summary

Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection…

Summary

Data is only valuable if you use it for…

05 August 2019 | 00:51:48


Simplifying Data Integration Through Eventual Connectivity - E91

Summary

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new…

Summary

The ETL pattern that has become…

29 July 2019 | 00:53:47


Straining Your Data Lake Through A Data Mesh - E90

Summary

The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and…

Summary

The current trend in data management is…

22 July 2019 | 01:04:28


Data Labeling That You Can Feel Good About With CloudFactory - E89

Summary

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems…

Summary

Successful machine learning and…

15 July 2019 | 00:57:50


Scale Your Analytics On The Clickhouse Data Warehouse - E88

Summary

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is…

Summary

The market for data warehouse platforms…

08 July 2019 | 01:11:19


Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - E87

Summary

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he…

Summary

Anomaly detection is a capability that is…

02 July 2019 | 00:38:03


The Workflow Engine For Data Engineers And Data Scientists - E86

Summary

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data…

Summary

Building a data platform that works…

25 June 2019 | 01:08:26


Maintaining Your Data Lake At Scale With Spark - E85

Summary

Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics.…

Summary

Building and maintaining a data lake is a…

17 June 2019 | 00:50:50