Data Engineering Podcast

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

About the show

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.


  • Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

    January 22nd, 2023  |  45 mins 40 secs

    The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

  • Building Applications With Data As Code On The DataOS

    January 15th, 2023  |  48 mins 36 secs

    The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.

  • Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

    January 8th, 2023  |  44 mins 5 secs

    Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.

  • Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

    December 29th, 2022  |  59 mins 21 secs

    Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.

  • Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

    December 28th, 2022  |  58 mins 45 secs

    With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.

  • Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

    December 25th, 2022  |  1 hr 8 mins
    data analytics, encryption, machine learning, security

    Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.

  • An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

    December 25th, 2022  |  1 hr 11 mins

    Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.

  • Making Sense Of The Technical And Organizational Considerations Of Data Contracts

    December 18th, 2022  |  47 mins

    One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.

  • Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle

    December 18th, 2022  |  1 hr 5 mins

    The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.

  • Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

    December 11th, 2022  |  53 mins 45 secs

    An interview with Frank Liu about how the open source Towhee library simplifies the work of building pipelines to generate vector embeddings of your data for building machine learning projects.

  • Run Your Applications Worldwide Without Worrying About The Database With Planetscale

    December 11th, 2022  |  49 mins 40 secs

    An interview with Nick van Wiggeren about the Planetscale serverless MySQL service built on top of the open source Vitess project and the impact on developer productivity that it offers when you don't have to worry about database operations.

  • Business Intelligence In The Palm Of Your Hand With Zing Data

    December 4th, 2022  |  46 mins 46 secs

    An interview with Sabin Thomas about how Zing Data is lets you bring business intelligence with you when you're on the go with first-class support for mobile devices

  • Adopting Real-Time Data At Organizations Of Every Size

    December 4th, 2022  |  50 mins 24 secs

    An interview with Arjun Narayan about how to enable organizations of all sizes to take advantage of real-time data, including the technical and organizational investments required to make it happen.

  • Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

    November 27th, 2022  |  50 mins 25 secs

    An interview with Wes McKinney about his work at Voltron Data to support and grow the Arrow project and its integration with the broader data ecosystem

  • Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase

    November 27th, 2022  |  59 mins 24 secs

    An interview with Matt Jaffee about FeatureBase, an open source bitmap database that allows you to query and analyze massive data sets at interactive speeds and the work they have done to simplify integration with the rest of your data platform.

  • Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

    November 20th, 2022  |  46 mins 46 secs

    An interview with Salma Bakouk about how to use data entropy as a model for identifying and resolving problems in your data platform before they occur and Sifflet's approach to full stack data observability.