Latest Episodes

Bringing Automation To Data Labeling For Machine Learning With Watchful - Episode 316

Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved...

Play Episode

Useful Lessons And Repeatable Patterns Learned From Data Mesh Implementations At AgileLab - Episode 314

Data mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data...

Play Episode

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus - Episode 313

The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to...

Play Episode

Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda - Episode 311

Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the...

Play Episode

Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering - Episode 310

Data engineering is a difficult job, requiring a large number of skills that often don't overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book "Fundamentals of Data Engineering". In this episode they share their experiences researching and distilling the lessons...

Play Episode

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster - Episode 309

The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools' dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster's 1.0 release, and the new features coming with Dagster Cloud's...

Play Episode

Making The Total Cost Of Ownership For External Data Manageable With Crux - Episode 308

There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to...

Play Episode

Joe Reis Turns The Tables And Interviews Tobias Macey About The Data Engineering Podcast - Episode 307

Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time.

Play Episode

Join The Mailing List