Case Studies

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - Episode 87

Summary

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
    • What are some example cases where anomaly detection is useful or necessary?
  • Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
  • What was your selection criteria for the various components of your system design?
    • What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
      • If you were to start over today would you do any of it differently?
  • Can you talk through the algorithm that you used for detecting anomalous activity?
    • What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
  • What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
  • What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
    • How did those bottlenecks differ as you moved through different levels of scale?
  • What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
  • What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
  • How have those lessons fed back to your work at Instaclustr?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Evolving An ETL Pipeline For Better Productivity - Episode 83

Summary

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
  • Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
    • What are your primary sources of data and what are the targets that you are loading them into?
  • What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
    • What were your criteria for your replacement technology and how did you gather and evaluate your options?
  • Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
    • What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
    • What were the big wins?
  • What was your evaluation framework for determining whether your re-engineering was successful?
  • Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
    • If you have freed up time for your engineers, how are you allocating that spare capacity?
  • What do you hope to see from DataCoral in the future?
  • What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary

Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of a data pipeline?
    • At what point in the life of a project or organization should you start thinking about building a pipeline?
  • In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
    • What metrics/use cases should you be optimizing for at this point?
  • What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
    • How do the design requirements for a data pipeline change as you reach this stage?
    • What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
  • What are some of the changes that are necessary as you move to a large scale data pipeline?
  • At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
  • In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
  • When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
  • Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
  • What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
    • How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
  • What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
  • What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
  • What are your plans for improving your current pipeline at Grubhub?
  • What are some references that you recommend for anyone who is designing a new data platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
    • Where are you using notebooks and where are you not?
  • What is the technical infrastructure that you have built to suppport that design choice?
  • Which team was driving the effort?
    • Was it difficult to get buy in across teams?
  • How much shared code have you been able to consolidate or reuse across teams/roles?
  • Have you investigated the use of any of the other notebook platforms for similar workflows?
  • What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
  • What are some of the limitations of the notebook environment for the work that you are doing?
  • What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
  • What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
    • How do you define the concept of a knowledge graph?
  • What are the processes involved in constructing a knowledge graph?
  • Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
  • What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
    • How do you manage the software lifecycle for your ETL code?
    • What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
  • What are the current challenges that you are facing in building and scaling your data infrastructure?
    • How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
    • What techniques are you using to manage accuracy and consistency in the data that you ingest?
  • Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
  • What are the weak spots in your platform that you are planning to address in upcoming projects?
    • If you were to start from scratch today, what would you have done differently?
  • What are some of the most interesting or unexpected uses of your product that you have seen?
  • What is in store for the future of Enigma?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Airflow Into Production With James Meickle - Episode 43

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your initial project requirement?
    • What tooling did you consider in addition to Airflow?
    • What aspects of the Airflow platform led you to choose it as your implementation target?
  • Can you describe your current deployment architecture?
    • How many engineers are involved in writing tasks for your Airflow installation?
  • What resources were the most helpful while learning about Airflow design patterns?
    • How have you architected your DAGs for deployment and extensibility?
  • What kinds of tests and automation have you put in place to support the ongoing stability of your deployment?
  • What are some of the dead-ends or other pitfalls that you encountered during the course of this project?
  • What aspects of Airflow have you found to be lacking that you would like to see improved?
  • What did you wish someone had told you before you started work on your Airflow installation?
    • If you were to start over would you make the same choice?
    • If Airflow wasn’t available what would be your second choice?
  • What are your next steps for improvements and fixes?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving a brief overview of Heap?
  • One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?
  • Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?
  • Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?
    • What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?
  • What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?
    • What challenges does that pose in your processing architecture?
  • What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?
    • How has that architecture changed or evolved over the life of the company?
    • What are some changes that you are anticipating in the near future?
  • Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?
  • What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?
  • What changes have been necessary as a result of GDPR?
  • What are your plans for the future of Heap?

Contact Info

  • @danlovesproofs on twitter
  • [email protected]
  • @drob on github
  • heapanalytics.com / @heap on twitter
  • https://heapanalytics.com/blog/category/engineering

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Summary

Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Why don’t you start by explaining what ThreatStack does?
    • What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?
  • Can you describe the type(s) of data that you collect and how it is structured?
  • What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?
    • How do you ensure a consistent format of the information that you receive?
    • How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended?
    • How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?
  • I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?
    • How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?
  • How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure?
  • What are some of the most common vulnerabilities that you detect in your client’s infrastructure?
  • For someone who wants to start using ThreatStack, what does the setup process look like?
  • What have you found to be the most challenging aspects of building and managing the data processes in your environment?
  • What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Summary

One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Honeycomb and how did you get started at the company?
  • Can you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph?
  • What are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale?
  • In addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns?
  • A high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user?
  • How does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool?
  • What have been some of the most challenging aspects of building, scaling, and marketing Honeycomb?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA