Data Engineering Podcast

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

About the show

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

  • Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

    February 11th, 2023  |  52 mins 2 secs

    Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

  • Reflecting On The Past 6 Years Of Data Engineering

    February 5th, 2023  |  32 mins 21 secs

    This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

  • Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

    January 29th, 2023  |  50 mins 43 secs

    Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.

  • Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

    January 22nd, 2023  |  45 mins 40 secs

    The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

  • Building Applications With Data As Code On The DataOS

    January 15th, 2023  |  48 mins 36 secs

    The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.

  • Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

    January 8th, 2023  |  44 mins 5 secs

    Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.

  • Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

    December 29th, 2022  |  59 mins 21 secs

    Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.

  • Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

    December 28th, 2022  |  58 mins 45 secs

    With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.

  • An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

    December 25th, 2022  |  1 hr 11 mins

    Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.

  • Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

    December 25th, 2022  |  1 hr 8 mins
    data analytics, encryption, machine learning, security

    Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.

  • Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle

    December 18th, 2022  |  1 hr 5 mins

    The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.

  • Making Sense Of The Technical And Organizational Considerations Of Data Contracts

    December 18th, 2022  |  47 mins

    One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.

  • Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

    December 11th, 2022  |  53 mins 45 secs

    An interview with Frank Liu about how the open source Towhee library simplifies the work of building pipelines to generate vector embeddings of your data for building machine learning projects.

  • Run Your Applications Worldwide Without Worrying About The Database With Planetscale

    December 11th, 2022  |  49 mins 40 secs

    An interview with Nick van Wiggeren about the Planetscale serverless MySQL service built on top of the open source Vitess project and the impact on developer productivity that it offers when you don't have to worry about database operations.

  • Business Intelligence In The Palm Of Your Hand With Zing Data

    December 4th, 2022  |  46 mins 46 secs

    An interview with Sabin Thomas about how Zing Data is lets you bring business intelligence with you when you're on the go with first-class support for mobile devices

  • Adopting Real-Time Data At Organizations Of Every Size

    December 4th, 2022  |  50 mins 24 secs

    An interview with Arjun Narayan about how to enable organizations of all sizes to take advantage of real-time data, including the technical and organizational investments required to make it happen.