Data Infrastructure

Automating Your Production Dataflows On Spark - Episode 105

As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.

Read More

Data Orchestration For Hybrid Cloud Analytics - Episode 103

The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Read More

Fast Analytics On Semi-Structured And Structured Data In The Cloud - Episode 101

The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

Read More

Building A Reliable And Performant Router For Observability Data - Episode 97

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Read More

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Read More

Simplifying Data Integration Through Eventual Connectivity - Episode 91

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Read More

Straining Your Data Lake Through A Data Mesh - Episode 90

The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

Read More

Scale Your Analytics On The Clickhouse Data Warehouse - Episode 88

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

Read More

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - Episode 87

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Read More