Data Engineering Podcast


This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

Rewind 10 seconds
1X
Skip 30 seconds ahead
0:00/0:00

Listen in your favorite app:



More options

Here are shows you might like

See show recommendations
AI Engineering Podcast
Tobias Macey
The Python Podcast.__init__
Tobias Macey

471 Episodes

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60 - E60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book…

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented…

10 December 2018 | 00:50:31


Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59 - E59

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it…

Summary

Distributed systems are complex to build and operate, and there are…

03 December 2018 | 00:54:25


Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58 - E58

Summary

When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this…

Summary

When your data lives in multiple locations, belonging to at least as many…

26 November 2018 | 00:39:18


Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57 - E57

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is…

Summary

Modern applications and data platforms aspire to process events and data in real…

19 November 2018 | 00:48:02


How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56 - E56

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are…

Summary

A data lake can be a highly valuable resource, as long as it is well built and…

11 November 2018 | 00:51:51


Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55 - E55

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern…

Summary

Business intelligence is a necessity for any organization that wants to be able…

05 November 2018 | 00:58:04


Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54 - E54

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix…

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to…

29 October 2018 | 00:40:55


Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53 - E53

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative…

Summary

As data science becomes more widespread and has a bigger impact on the lives of…

22 October 2018 | 00:45:32


Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52 - E52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which…

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations…

15 October 2018 | 00:53:46


Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - E51

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same…

Summary One of the most complex aspects of managing data for analytical workloads is moving it from…

09 October 2018 | 00:56:55