Data Engineering Podcast

Episode Archive

Episode Archive

419 episodes of Data Engineering Podcast since the first episode, which aired on January 7th, 2017.

  • Addressing The Challenges Of Component Integration In Data Platform Architectures

    November 26th, 2023  |  29 mins 42 secs

    Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.

  • Unlocking Your dbt Projects With Practical Advice For Practitioners

    November 19th, 2023  |  1 hr 16 mins

    The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.

  • Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

    November 12th, 2023  |  1 hr 7 mins

    Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

  • Shining Some Light In The Black Box Of PostgreSQL Performance

    November 5th, 2023  |  54 mins 51 secs

    Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

  • Surveying The Market Of Database Products

    October 29th, 2023  |  47 mins 12 secs

    Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

  • Defining A Strategy For Your Data Products

    October 22nd, 2023  |  1 hr 3 mins

    The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.

  • Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

    October 15th, 2023  |  1 hr 8 mins

    Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

  • Using Data To Illuminate The Intentionally Opaque Insurance Industry

    October 8th, 2023  |  51 mins 58 secs

    The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

  • Building ETL Pipelines With Generative AI

    October 1st, 2023  |  51 mins 36 secs

    Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

  • Powering Vector Search With Real Time And Incremental Vector Indexes

    September 24th, 2023  |  59 mins 16 secs

    The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

  • Building Linked Data Products With JSON-LD

    September 17th, 2023  |  1 hr 1 min

    A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

  • An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem

    September 10th, 2023  |  1 hr 1 min

    Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.

  • Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

    September 3rd, 2023  |  42 mins 12 secs

    Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

  • Building An Internal Database As A Service Platform At Cloudflare

    August 27th, 2023  |  1 hr 1 min

    Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.

  • Harnessing Generative AI For Creating Educational Content With Illumidesk

    August 20th, 2023  |  54 mins 52 secs

    Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.

  • Unpacking The Seven Principles Of Modern Data Pipelines

    August 13th, 2023  |  47 mins 2 secs

    Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.