Data Engineering Podcast

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

About the show

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

  • Unpacking The Seven Principles Of Modern Data Pipelines

    August 13th, 2023  |  47 mins 2 secs

    Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.

  • Quantifying The Return On Investment For Your Data Team

    August 6th, 2023  |  1 hr 1 min

    As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.

  • Strategies For A Successful Data Platform Migration

    July 30th, 2023  |  1 hr 9 mins

    All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

  • Build Real Time Applications With Operational Simplicity Using Dozer

    July 23rd, 2023  |  40 mins 42 secs

    Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.

  • Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future

    July 16th, 2023  |  54 mins 45 secs

    Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.

  • Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

    July 9th, 2023  |  1 hr 12 mins

    For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

  • How Data Engineering Teams Power Machine Learning With Feature Platforms

    July 3rd, 2023  |  1 hr 3 mins
    feature engineering

    Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.

  • Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

    June 25th, 2023  |  50 mins 19 secs
    dataops

    Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

  • How Column-Aware Development Tooling Yields Better Data Models

    June 17th, 2023  |  46 mins 19 secs

    Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.

  • Build Better Tests For Your dbt Projects With Datafold And data-diff

    June 11th, 2023  |  48 mins 21 secs

    Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.

  • Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service

    June 4th, 2023  |  54 mins 5 secs
    dataops, vendor

    A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.

  • A Roadmap To Bootstrapping The Data Team At Your Startup

    May 28th, 2023  |  42 mins 31 secs
    data teams

    Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.

  • Keep Your Data Lake Fresh With Real Time Streams Using Estuary

    May 21st, 2023  |  55 mins 50 secs

    Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.

  • What Happens When The Abstractions Leak On Your Data

    May 14th, 2023  |  26 mins 41 secs

    All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

  • Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify

    May 7th, 2023  |  54 mins 34 secs
    customer data platform

    Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.

  • Realtime Data Applications Made Easier With Meroxa

    April 23rd, 2023  |  45 mins 26 secs

    Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.