This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
January 22nd, 2023 | 45 mins 40 secs
The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
January 15th, 2023 | 48 mins 36 secs
The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
January 8th, 2023 | 44 mins 5 secs
Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI
December 29th, 2022 | 59 mins 21 secs
Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
December 28th, 2022 | 58 mins 45 secs
With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems
December 25th, 2022 | 1 hr 8 mins
data analytics, encryption, machine learning, security
Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
December 25th, 2022 | 1 hr 11 mins
Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
December 18th, 2022 | 47 mins
One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.
December 18th, 2022 | 1 hr 5 mins
The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
December 11th, 2022 | 53 mins 45 secs
An interview with Frank Liu about how the open source Towhee library simplifies the work of building pipelines to generate vector embeddings of your data for building machine learning projects.
December 11th, 2022 | 49 mins 40 secs
An interview with Nick van Wiggeren about the Planetscale serverless MySQL service built on top of the open source Vitess project and the impact on developer productivity that it offers when you don't have to worry about database operations.
December 4th, 2022 | 46 mins 46 secs
An interview with Sabin Thomas about how Zing Data is lets you bring business intelligence with you when you're on the go with first-class support for mobile devices
December 4th, 2022 | 50 mins 24 secs
An interview with Arjun Narayan about how to enable organizations of all sizes to take advantage of real-time data, including the technical and organizational investments required to make it happen.
November 27th, 2022 | 50 mins 25 secs
An interview with Wes McKinney about his work at Voltron Data to support and grow the Arrow project and its integration with the broader data ecosystem
November 27th, 2022 | 59 mins 24 secs
An interview with Matt Jaffee about FeatureBase, an open source bitmap database that allows you to query and analyze massive data sets at interactive speeds and the work they have done to simplify integration with the rest of your data platform.
November 20th, 2022 | 46 mins 46 secs
An interview with Salma Bakouk about how to use data entropy as a model for identifying and resolving problems in your data platform before they occur and Sifflet's approach to full stack data observability.