Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

02 July 2019

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - E87

0:00/0:00

Share on social media:

Summary

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra

Interview

Introduction
How did you get involved in the area of data management?
Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
- What are some example cases where anomaly detection is useful or necessary?
Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
What was your selection criteria for the various components of your system design?
- What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
  - If you were to start over today would you do any of it differently?
Can you talk through the algorithm that you used for detecting anomalous activity?
- What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
- How did those bottlenecks differ as you moved through different levels of scale?
What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
How have those lessons fed back to your work at Instaclustr?