Data Pipelines

Building A Reliable And Performant Router For Observability Data - Episode 97

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Read More

Building Tools And Platforms For Data Analytics - Episode 95

Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.

Read More

A High Performance Platform For The Full Big Data Lifecycle - Episode 94

Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Read More

Digging Into Data Replication At Fivetran - Episode 93

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

Read More

Simplifying Data Integration Through Eventual Connectivity - Episode 91

The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Read More

The Workflow Engine For Data Engineers And Data Scientists - Episode 86

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Read More

Data Lineage For Your Pipelines - Episode 82

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Read More