Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey

469 Episodes

Cleaning And Curating Open Data For Archaeology - E68

Summary

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that…

Summary

Archaeologists collect and create a…

04 February 2019 | 01:00:56


Managing Database Access Control For Teams With strongDM - E67

Summary

Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the…

Summary

Controlling access to a database is a…

29 January 2019 | 00:42:18


Building Enterprise Big Data Systems At LEGO - E66

Summary

Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and…

Summary

Building internal expertise around big…

21 January 2019 | 00:48:04


TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65 - E65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases…

Summary

The past year has been an active one for the timeseries market. New products…

14 January 2019 | 00:41:26


Performing Fast Data Analytics Using Apache Kudu - Episode 64 - E64

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes…

Summary

The Hadoop platform is purpose built for processing large, slow moving data in…

07 January 2019 | 00:50:47


Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63 - E63

Summary

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address…

Summary

As more companies and organizations are working to gain a real-time view of…

31 December 2018 | 00:44:42


Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62 - E62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your…

Summary

Processing high velocity time-series data in real-time is a complex challenge.…

24 December 2018 | 01:03:52


Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61 - E61

Summary

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data…

Summary

Every business needs a pipeline for their critical data, even if it is just…

17 December 2018 | 00:39:22


Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60 - E60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book…

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented…

10 December 2018 | 00:50:31


Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59 - E59

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it…

Summary

Distributed systems are complex to build and operate, and there are…

03 December 2018 | 00:54:25