RudderStack

Strategies For Proactive Data Quality Management - Episode 205

Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

Read More

Exploring The Design And Benefits Of The Modern Data Stack - Episode 203

We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and “best practices” to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the “Modern Data Stack”. In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.

Read More

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager - Episode 201

At the core of every data workflow is an orchestration engine (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.

Read More

A Candid Exploration Of Timeseries Data Analysis With InfluxDB - Episode 199

While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.

Read More

Make Database Performance Optimization A Playful Experience With OtterTune - Episode 197

The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.

Read More

Accelerating Machine Learning Training And Delivery With In-Database ML - Episode 195

When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.

Read More

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer - Episode 193

The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years.

Read More

Making Data Pipelines Self-Serve For Everyone With Shipyard - Episode 191

Every part of the business relies on data, yet only a small team has the context and expertise to build workflows and pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.

Read More

Easily Build Advanced Similarity Search With The Pinecone Vector Database - Episode 189

Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.

Read More

Unlocking The Power of Data Lineage In Your Platform with OpenLineage - Episode 187

Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.

Read More