Streaming

Navigating Boundless Data Streams With The Swim Kernel - Episode 98

The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Read More

Building A Reliable And Performant Router For Observability Data - Episode 97

The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.

Read More

Customer Analytics At Scale With Segment - Episode 72

Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

Read More

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.

Read More

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Read More

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Read More

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Read More

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Read More

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.

Read More

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Read More