Census

Decoupling Data Operations From Data Infrastructure Using Nexla - Episode 215

The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.

Read More

Migrate And Modify Your Data Platform Confidently With Compilerworks - Episode 213

A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Read More

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi - Episode 209

Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

Read More

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax - Episode 207

Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.

Read More

Strategies For Proactive Data Quality Management - Episode 205

Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

Read More

Exploring The Design And Benefits Of The Modern Data Stack - Episode 203

We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and “best practices” to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the “Modern Data Stack”. In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.

Read More

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager - Episode 201

At the core of every data workflow is an orchestration engine (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.

Read More

A Candid Exploration Of Timeseries Data Analysis With InfluxDB - Episode 199

While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.

Read More

Make Database Performance Optimization A Playful Experience With OtterTune - Episode 197

The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.

Read More