Open Source

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte - Episode 291

Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.

Read More

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs - Episode 283

There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.

Read More

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera - Episode 275

Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.

Read More

Accelerate Your Embedded Analytics With Apache Pinot - Episode 273

Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.

Read More

Move Your Database To The Data And Speed Up Your Analytics With DuckDB - Episode 270

When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.

Read More

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue - Episode 266

Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

Read More

Build Your Own End To End Customer Data Platform With Rudderstack - Episode 263

Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.

Read More

Open Source Reverse ETL For Everyone With Grouparoo - Episode 254

Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.

Read More

Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary - Episode 252

Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.

Read More

Fast And Flexible Headless Data Analytics With Cube.JS - Episode 248

One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.

Read More