Data Pipelines

Making Data Pipelines Self-Serve For Everyone With Shipyard - Episode 191

Every part of the business relies on data, yet only a small team has the context and expertise to build workflows and pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.

Read More

Moving Machine Learning Into The Data Pipeline at Cherre - Episode 181

Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.

Read More

Real World Change Data Capture At Datacoral - Episode 177

The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.

Read More

Self Service Open Source Data Integration With AirByte - Episode 173

Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Read More

How Shopify Is Building Their Production Data Warehouse Using DBT - Episode 171

With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.

Read More

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch - Episode 168

The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics.

Read More

Streaming Data Integration Without The Code at Equalum - Episode 161

The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.

Read More

Keeping A Bigeye On The Data Quality Market - Episode 160

One of the oldest aphorisms about data is “garbage in, garbage out”, which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.

Read More

Add Version Control To Your Data Lake With LakeFS - Episode 157

Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.

Read More

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor - Episode 151

Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work.

Read More