Data Engineering Podcast

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

About the show

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

  • Reconciling The Data In Your Databases With Datafold

    March 17th, 2024  |  58 mins 14 secs

    A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.

  • Version Your Data Lakehouse Like Your Software With Nessie

    March 10th, 2024  |  40 mins 55 secs

    Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

  • When And How To Conduct An AI Program

    March 3rd, 2024  |  46 mins 25 secs

    Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.

  • Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

    February 25th, 2024  |  56 mins

    Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

  • Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

    February 18th, 2024  |  58 mins 46 secs

    A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.

  • Data Sharing Across Business And Platform Boundaries

    February 11th, 2024  |  59 mins 55 secs

    Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

  • Tackling Real Time Streaming Data With SQL Using RisingWave

    February 4th, 2024  |  56 mins 55 secs

    Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

  • Build A Data Lake For Your Security Logs With Scanner

    January 28th, 2024  |  1 hr 2 mins

    Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

  • Modern Customer Data Platform Principles

    January 21st, 2024  |  1 hr 1 min

    Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

  • Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

    January 7th, 2024  |  50 mins 26 secs

    Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.

  • Designing Data Platforms For Fintech Companies

    December 31st, 2023  |  47 mins 56 secs

    Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.

  • Troubleshooting Kafka In Production

    December 24th, 2023  |  1 hr 14 mins

    Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

  • Adding An Easy Mode For The Modern Data Stack With 5X

    December 17th, 2023  |  56 mins 12 secs

    The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

  • Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

    December 10th, 2023  |  51 mins 17 secs

    If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.

  • Designing Data Transfer Systems That Scale

    December 4th, 2023  |  1 hr 3 mins

    The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.

  • Addressing The Challenges Of Component Integration In Data Platform Architectures

    November 26th, 2023  |  29 mins 42 secs

    Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.