RudderStack

What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role? - Episode 281

Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.

Read More

Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel - Episode 279

Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.

Read More

Repeatable Patterns For Designing Data Platforms And When To Customize Them - Episode 277

Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.

Read More

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera - Episode 275

Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.

Read More

Exploring Incident Management Strategies For Data Teams - Episode 274

Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.

Read More

Taking A Multidimensional Approach To Data Observability At Acceldata - Episode 272

Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.

Read More

Move Your Database To The Data And Speed Up Your Analytics With DuckDB - Episode 270

When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.

Read More

Reflections On Designing A Data Platform From Scratch - Episode 268

Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.

Read More

Understanding The Immune System With Data At ImmunAI - Episode 265

The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.

Read More