Building A Real Time Event Data Warehouse For Sentry - Episode 108

Summary

The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities?
    • What did the previous system look like?
  • What was your design criteria for building a new platform?
    • What was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse?
  • Can you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work?
    • What have been some of the sharp edges of Clickhouse that you have had to engineer around?
    • How have you found the operational aspects of Clickhouse?
  • How did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data?
  • What are some of the downstream benefits of using Clickhouse for managing event data at Sentry?
  • For someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts?
  • What are some of the other data challenges that you are currently facing at Sentry?
    • What is your next highest priority for evolving or rebuilding to address technical or business challenges?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai, and for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show To learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media corneum, global intelligence, a luxio, and data Council. Go to data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Ted caning and James Cunningham about the new open source search service at century implemented on top of click house. So Ted, can you start by introducing yourself?
Ted Kaemming
0:01:42
Hey, my name is Ted camming. I am a software engineer here at century I work on the the search and storage team which is responsible for as the name would imply search and storage. A lot of that happens through your quickcast these days.
Tobias Macey
0:01:58
And James, how about yourself
James Cunningham
0:02:00
Yeah, like I said, my name is still James Cunningham. I manage the search and storage team. I also manage engineering operations. I was the first operations engineer to deploy and maintain our Eclipse clusters. So good person, that's awesome.
Tobias Macey
0:02:16
And so going back to you to remember how you first got involved in the area of data management
Ted Kaemming
0:02:21
I it's something where so I previous to to century, I worked at a company called discuss at that company, we had a lot of reads compared to century which is incredibly right heavy. I worked also an infrastructure team there as well, mostly dealing with shuttling data around as you do in kind of any any web application. So I don't know that necessarily characterize myself as someone who has, has really focused on on data management as a specific specific career path or anything like that, but it seems like on any any web application The shared state for that application is so so critical to that application is functioning, that everybody kind of trends in that direction, the more these things scale. So, through through my career, I've just spent a lot of time interacting with databases, and you get to know them fairly well through the process, I guess.
Tobias Macey
0:03:21
Yeah, I'm familiar with that. It sort of follows you around as much as you try to run away.
Ted Kaemming
0:03:26
Yeah, for sure. Yeah. Everything eventually will talk to the database, it seems like
Tobias Macey
0:03:31
and James, how about you? Do you remember how you first got involved in the area of data management?
James Cunningham
0:03:35
I did. Before I met the wonderful people at century I used to work at a place called urban airship. that's currently known as airship. We had a non trivial amount of HBase clusters, the open source implementation of Google big table and I was privileged enough to kind of work alongside those and I said, Wow, these are kind of pretty cool. We paired Got a lot of Postgres central ram Postgres, I actually operated century at ua, then I came over here because I had direct experience. And when we kind of started moving away from being a Postgres shop, I said, aha, it's my time to kind of dig back into my brain about how nano storage kind of worked. And ever since then, I've been loving this and kind of hating Postgres. So doing.
0:04:25
We still like Postgres done.
Tobias Macey
0:04:28
Yeah, so I know that when I interviewed Dave Kramer on my other podcast about century and some of the early days of that he was mentioning that, at least for a while, the overall data infrastructure was running entirely on Postgres and didn't use Elasticsearch or anything just to use some of the full text indexing on Postgres for being able to handle searching across the event data. So I'm wondering if you can just start by describing some of the stresses and pain points that you are running into the existing infrastructure and How that led you to decided that you needed to re architect and build this new system.
Ted Kaemming
0:05:05
Yeah. So when I joined, which was, which was pretty early on century as a as a company's history at that point. When events come into century day, they're basically being written to a collection of different data stores after processing. So we kind of logically split that up into four different kind of service components, I guess you could call them. There's the time series database that was backed by Redis at the time, primarily that's that was discounters bucketed by a time granularity. We also had some sketches type of our blog, stuff like that in there to get unique counts. So TFTP was one. Another one was this thing called tag store. Tag store was basically just a abstraction we had around dealing with denormalize data for like the tag key value pairs that come up along with an event. So that was in Postgres started off on the same Postgres databases, everything else eventually got split off into its own database. And then we eventually had to shard those tables across a larger cluster. Then also another logical component was the search back in the search back end was really just Postgres. I don't think we ever actually had Elasticsearch running for any thing, at least during during my history here in production. So the search backend was like you like you mentioned, just running sequel queries over these tables. That as you can probably imagine, when those tables became physically separated under different hosts got more complicated. And then the fourth place that event data ended up in is a abstraction called node store. Node store at the time was right on REAC from basho, which has since kind of been e Ll As century those open source product as well. All of these things had different pluggable backends. So other people who are running century and their own infrastructure, you know, they could run it just using Postgres. Again, they could use it running Cassandra. These days, we run it using big table. So we had, we had all this data split into these, these four different storage systems, that was all basically derivations of the same input data. those rights generally happened via salary. So a kind of task abstraction written in Python most of century is is Python. We're big fans of Python, honestly, we've been able to to make it work for our load. But since this is all kind of going through celery and reason, rabbit at the time is the Message Broker. We're just mutating single rows at a time that became a performance pain point. Some of that ended up in a situation where we were buffering these rights, so increments these counters, things like that, through this kind of buffering system that we built in Redis, and would apply bolt updates to those denormalized tables. So basically, we just had a bunch of different thing bunch of different data systems laying around managing different, different lenses of looking at the same actual input data. So it got particularly complicated.
Tobias Macey
0:08:33
And were you starting to run into any user facing problems or was most of the pain just in terms of the internal operations and trying to deal with managing the data and the cognitive load on the engineering side?
Ted Kaemming
0:08:47
Um, there's definitely
0:08:51
there's there's both kind of that internal pain as well as the user facing pain. I think really a lot of the the user facing was just manifestations, the internal pain. So internally like, like you mentioned, there's all this operational complexity with running all these different systems, there's just tons of points of failure scaling with the data volume was an issue. Also the other code complexity, all those things eventually become user problems because the operational complexity of having all these single points of failure, you know, that affects the user experience when those single points of failure fail. And, you know, if someone can't can't look at their data on century, the code complexity manifests itself in the fact that we're not able to iterate and like build new features as quickly as we'd like. Along with that there was a bunch of other stuff with like consistency problems because all of these systems are ingesting at different rates, or like this buffering system time, sometimes we're back up, and so you would see these d normalizations. that hadn't been updated in potentially several minutes. In worst cases, those kinds of things would obviously affect the user experience. When those things affect the user experience, they write into support. That means an engineer is having to investigate what's going on. All of a sudden, the engineer can't make forward progress on improving these things, because we're busy, you know, diagnosing and explaining this sort of idiosyncratic behavior overall, you know, things just kind of slow down from there. So, yeah, it's it was, it was it was both internal, I think as well externally, like, just not what we would have liked to present an idealized situation.
Tobias Macey
0:10:39
Yeah, especially given that your product is used in such a way that people want to be notified in a timely fashion and potentially need near real time capabilities because it could be affecting their customers. So definitely a lot of cascading errors that can propagate from what seems like a simple tech debt problem to something that eventually explodes into something that, like you said, grinds everything to a halt. And so that finally catalyzed you into deciding that you needed to redesign and rebuild these capabilities. And I'm curious what your design and operational criteria were for determining what needed to go into building this new platform.
James Cunningham
0:11:20
Yeah, so I'd say as far as all the decisions that we made in order to go into this new platform, one of the biggest leaders was that we had a big push for having environments be kind of like a first class filtration, we had to build a new dimensionality of data across all this denormalized data, essentially doubled the storage that we had. And then we said to ourselves, like all this is great, this looks cool. environments are dope. But what happens we want to add another dimension and have dimension or we're just going to continue to, I guess, like, extrapolate across this data set and eventually end up with 100 terabytes of, you know, five different dimensions of data. So we said ourselves That we kind of needed a flat event model that we'd be able to kind of search across and to ourselves, you know, there are a few other pieces that we want. And on top of that, we want to be able to search across these arbitrary fields that we really, really looked into whether those are custom tags or something that we kind of promote, whether that is like releases or traces or searching across messages. We didn't want that to take as long as it did. And some of the other parts is that we have all this data stored in, you know, this tag store and all these searches that we have to go through. But we have in a completely different side for time series data that again, had to have that dimensionality in it. If we search across these arbitrary fields, the next thing that a customer would ask for is, Hey, can I please see a pretty graph. So if we could boil down that search, and that time series data into the same system, we'd be destroying two systems with one rewrite.
Ted Kaemming
0:12:54
And also like as part of that process, I mean, you kind of always have this Standard checkpoints, you know, like the replication and durability is obviously really important for us ease of maintenance is huge, low cost as well for us. So even that just kind of ruled out some like the hosted magic storage solutions, like those kinds of pressures.
Tobias Macey
0:13:20
And as you were deciding how to architect this new system, can you talk through some of the initial list of possible components that you were evaluating and what the process was for determining whether something was going to stay or go in the final right?
James Cunningham
0:13:36
Yeah, of course. Um, so our first, I guess, thing that we kind of crossed off is no more orientation, Postgres to serve as well, probably wouldn't, you know, we hope that we could engineer a good solution on top of it, but ultimately, we decided we probably needed a different shape of database to get the query across. We've kind of had like, five major options. We had document stores, you know, we had Some sort of Google proprietary blend, because we are completely on GCP. We had, you know, more more generic distributed query stuff, you know, a little bit of Spark, maybe a little bit of presto, we took a look at other distributed databases, we ran a good amount of Cassandra and my old gig. So I know how to run that. And we also said, like, Oh, hey, we could just like, put data down on distance ourselves and not have to worry about this. Some of the other like, serious considered things that we had was a was a column restore some of these other ones that we actually like kick the tires on, was to do we kick the tires on Pino, and Druid. And ultimately, we found click house as a commerce store. And we kind of just started running it. And it was one of the easiest ones to kick the tires on. Some of these other like, I guess, you know, columnar stores built on top of distributed file systems. It really did take a good amount of bricks to put down in order to get to your first query. And some of the things that we wanted was figuring out operational costs on that. We want to be able to iterate across question You wanted to be able to kind of pare down all the dependencies that the service had. You know, while we weren't afraid to run a few JVM, or to run it, you know, a little bit of HDFS, that was something that realistically, I might not want to have to have, you know, an entire engineer dedicated to running something like that. And on the antithesis of that, you know, we can choose some of this Google proprietary blend, but how did it feel to go from having century only require Redis and Postgres to now saying, you can only run the new version on Google? Yeah, as a little bit silly. So we ended up really just getting through an MVP of I think, both Kudo and click house, and one of the one of the biggest ones that really did kick us and for anyone listening, go ahead and correct me if I'm wrong. But one of my memories was that one of our engineers, you know, started loading data into q2, and you didn't really know when it was there. It was great for you know, being able to being able to crunch down about your numbers, but one of our biggest things that you did kind of hint at Is that we do need real time data and to be able to write into this data store, and then to be able to read it on a consistent basis with one of the things we need it, we have the ability to have a feature called alert rules and what you say, hey, only tell me if, you know, any event with the tag, you know, foom in got in and the value equals to what it was only maybe like 10 events in the last hour. And you want to be able to read that pretty quickly so that when that 10th event comes in, you're not waiting minutes until that alert shows up and click houses able to do that. And so that kind of just got its way up to number one.
Ted Kaemming
0:16:39
Yeah, I think also in general, like, at century we try and kind of bias a little bit towards relatively simple solutions. And it seemed like click house there was, at least to us, based on our backgrounds, it seemed more straightforward to get running. And I think that as well. appealed to us quite a bit. The documentation is pretty solid. It's also open source. You know, a lot of us will be but you know, click house has a pretty active repository. They've been very responsive when we've had questions or issues, they're very public about their development plan. So I think a lot of these things just kind of kind of worked out in its favor.
Tobias Macey
0:17:22
Yeah, it's definitely from what I've been able to understand a fairly new entrant into the overall database and data storage market. But I've heard of a few different stories of people using it in fairly high load environments. So I heard about the work that you're doing with Snoop, as far as I understand. CloudFlare is also using it for some of their use cases. And they definitely operate at some pretty massive scale with high data volume. So it seems like a pretty impressive system that has a lot of different capabilities. And I was pretty impressed when I had some of the folks from all tend to be on the podcast A while ago to talk about their experience of working it and working with some of their clients on getting it deployed. And I'm curious what some of the other types of systems you are able to replace with click house were given that you as you said, you have these four different systems that you had to be able to replicate event data to Were you able to collapse them all down into this one storage engine.
Ted Kaemming
0:18:17
Yeah. So like in our code base, the those four different things, the TSP search, tag store, and node star all have kind of abstract service interfaces that really just sort of evolved from the fact that it's a open source projects, people wanted to use these these different methods for it. Three of those now are backed by the same data set and click house. So all the TSP data comes directly out of click house, there's no pre aggregation that happens anymore. It's just you know, we're just ripping over individual rows competing those aggregates on demand, at least for now. Search. Some of the data for search still lives in Postgres, but a lot of it now is it just runs in from log data in House essentially, tax store, we've removed how many servers were we using for tags?
James Cunningham
0:19:06
We had? Oh, goodness, like 12 and one haiman 3232 core and maybe 200 odd gigs. But you know, getting getting into some of these other stats that we have a little bit more down the list. We went from 52 terabytes of SSD to two terabytes. Which is a good number to break down from. Yeah,
Ted Kaemming
0:19:29
so we were able to absolutely, yeah, we were able to decommission like an entire Redis cluster, like cluster in quotes, and this entire Postgres cluster with drastically less hardware. And yeah, just the fact that it all reads from the same click house cluster. And there's none of this weird replication lag between all these systems. That's it's a huge positive.
Tobias Macey
0:19:52
Can you talk a bit more about the overall architecture of Snoopy itself and just some of the operations characteristics and experience that you've had in terms of click house itself and maybe some of the early pain points and sharp edges that you ran into as you are getting used to this new system.
Ted Kaemming
0:20:12
Yeah, sure. So I guess just to give you kind of a brief overview of the architecture, because it's, it's something that's really not particularly fancy. It's really Snoopy is just a small, like, a relatively small flask application at least small when you compare it with like the remainder of century. So it's a Yeah, it's a flask application and it just speaks HTTP. It's in Python. It's generally stateless rights as they come in. They go through a Kafka topic. It's published directly from the the remainder of the century kobus. The central code base in this new book codebase are actually completely independent, at least as far as like the project. Get to read. So century rights in this Kafka topic. This new book, consumer picks them up, does some de normalization Some data munging you know, kind of conventional Kafka consumer stuff and writes large batches of events to click house. We don't use the click house Kafka engine or anything particularly special for that we just use the complete Kafka driver from confluent, which is live already Kafka based. And that's all in on Python reads just me about half and also over HTTP. Not anything also particularly fancy there. We have some various optimizations that we we do kind of just a general query cache and duplication of queries. So that way, we don't have large queries that have long run times, executing concurrently on the cluster. We do some optimizations where we move some stuff from the where clause in click house sequel to a pre WHERE clause, which is basically the closest thing you get to any sort of query optimization. And we just some other just like query rewriting stuff based on our domain model. There's other rate limits and Quality of Service metrics logging type stuff that happens in there as well. As long as that all goes well, responses returned to the caller with something that is almost identical to what you would get if you're just interacting with the HTTP interface, click house itself. If it doesn't go, well, that ends up getting locked a century. And we we then kind of entered the system again to go to go look at it. So that's kind of a brief overview. It's, it's nothing particularly fancy.
Tobias Macey
0:22:31
Yeah, sometimes simple as best, particularly when you're dealing with something that is critical path is this.
James Cunningham
0:22:36
Yeah, for sure. Yeah, so talk a little bit of the early engineering that you might have alluded to. One of our I say one of our biggest early difficulties was that we've you know, we've we've spent a lot of eggs in the Postgres basket. So we turn this on and, you know, the queries that we've set up for a rather oriented database are just like, absolutely not met. columnar store, which is a crazy thing to say,
Ted Kaemming
0:23:03
it's so easy to type select star.
James Cunningham
0:23:06
So easy spelling is Howard. But, you know, there's there's some things that just absolutely did not cut over to this column or store that we kind of had to like redesign how we had every query, you know, a century kind of had a quick application of order by some arbitrary column and then limit by 1000, to be able to like, explicitly hit a binary tree index in Postgres. And that didn't matter in click house, you know, any sort of limits just kind of truncated, what rose you're returning if you applied an order by that would have taken your entire data set and ordered it so many other things is that we have a lot of select stars everywhere, like Ted said, and that is, honestly one of the worst ways to operate on a column or store because you're just reading from every liberal file. So maybe change that a little bit. Some of the other things that we kind of had, you know, we we didn't have a quarter planner, so there was a lot of like, Taking a query and just kind of moving pieces around. One of the things that Ted alluded to was the notion of a pre where when you have, you know, multiple columns that you want to you want to filter on and aware clause, you kind of have the ability to give click house a little bit of heuristics and say, This is the column that we believe has the highest selectivity. And you put them in a pre WHERE clause, it will read through that column first, you know, decide which block IDs it's going to read from for the rest of them. So if you have something along the lines of an event ID that for us is, you know, global unique, that might have a little bit higher selectivity than environment or you know, it release might have a little bit of higher selectivity. So we were kind of working around these edges by just swapping variables around and saying, Well, did that make it faster? And then we said, Yes, we kind of threw some high fives around.
Ted Kaemming
0:24:53
Yeah, they're like, also just the integration into some of the query patterns we have in century was a bit of a challenge. Click house is really designed to do particularly well with inserts, it does not do particularly well with updates or deletes to the point where they are actually like syntactically valid in the like click house flavored sequel. So we have except century as a whole is particularly insert heavy but it's not insert only and so we had to kind of work around. Basically the fact that click houses is extremely oriented towards inserts. We kind of ended up with something that actually James mentioned he worked on Cassandra in a past life I did as well. We ended up with a architecture that is fairly similar to Cassandra tombstone for how we delete data, where we kind of implement our own last right wins semantics on top of the replacing merge tree and click house. There's a long blog posts About how we do that, as part of, we have this field guide series that we've been working on where we go into some of these like weird things that we do with cookhouse. Similarly, for things like those alerts that James mentioned earlier, we basically require sequential consistency to be able to execute those queries effectively. That becomes a problem when you're dealing with multi master replication, like click house does. So we ended up having to do some kind of dodgy load balancing stuff, where we, we don't have a literal primary for all rights, but we kind of have this ad hoc primary, that all rights go to as long as that is up. And for some subset of queries, they are only allowed to evaluate on that that primary. It's not like guaranteed sequential consistency and like a true distributed system sense but it's it's good enough for what we need. It's also particularly complicated because the system doing the querying is not smoother. It's lives in the century codebase. And so we basically need to be able to notify the century codebase that these rows have been written to click half from Cuba as part of this. So we ended up having to engineer this solution where we have a commit log coming out of the smooth Kafka consumer that the century application is actually subscribed to that Kafka topic, the commit log Kafka topic and gating its own progress based on the progress of this new writer. There's also a blog post that goes into more depth about how we specifically implemented that on the century blog as part of this field guide series. But just yeah, things like that, that you like we knew things like the mutations were going to be something that we had to manage. We didn't particularly have strategy around it and The sequential consistency stuff probably caught us a little bit more by surprise than it should have, as we were doing some of our our kind of integration testing in production with us. And notice that some of the queries weren't exactly returning what we thought they would have. So that was that was something we also had to solve.
Tobias Macey
0:28:23
And you mentioned that one of the reasons that you ended up going further forward with click house than any of the other systems is that it was pretty easy to get up and running with and seemed fairly simple operationally. So I'm curious what you have found to be the case now that you're actually using it in production and putting it under heavier load in a clustered environment. And any sort of useful lessons that you've learned in the process. Do you think anybody else is evaluating click has to know about?
James Cunningham
0:28:51
Absolutely. So this is this is my time to shine.
0:28:55
So one of the things that I kind of had to had to make a concession Is that I've never worked with a database that possibly be bound by CPU. It's always been, you know, make sure that your disks are as fast as possible, you know that the data is on the
0:29:11
disks, you got to read from the disk.
0:29:13
And the reason that you know, it very well could be bound by CPU is that, you know, I've seen compression in the compression in the past, and I didn't really understand what compression could actually give you until we returned click house on sort of compression realistically, you know, brings our entire data set, you know, we kind of alluded to it earlier, brings our entire data set from 52 terabytes data, two terabytes, and about 800 gigs of those are surprisingly uncompressible because they're unique, you know, 32 character strings. If anyone can tell me a, an algorithm that helps compress that, I think that we made a TV series around that or something like that, but you know, for the for the right The rest of the data, it's so well compressed that being able to actually like compute across it does so well, you know, we, we run a small amount of servers to supply what is a large amount of a data set? You know, we've, we started, I wouldn't say that, like, if there was any advice to anyone out there, start by sharding. Never Never shard by two, because two is a curse to number in terms of distributed systems. But we really just started with, you know, three shards, three replicas. And you know, with that, with that blessed number of nine, we haven't gone up yet. We kind of have a high watermark of a terabyte per machine. Google gives a certain amount of read and write off that disk based on how much storage you have. And we've kind of unlocked a certain level and one terabyte for a machine on if anyone else is somehow running click house on GCP I guess on GCP that is, you know, we're we're about to apply our fourth shard. But realistically, some of the other things that are operationally sound is That, you know, as as much as we'd all love to, I guess like hammer on or praise XML. It is it is very explicit about about what you have to write in. Its configured via XML. There's no runtime configuration that you're applying. There's no you know, magic distribution of writing into an options store and watching that cascade into a cluster
0:31:23
auto scaling.
0:31:24
Yeah, I'm not I'm not, you know, crunching in any Kubernetes pods or anything like that. One of the things I'd be remiss to not say is that you did mention CloudFlare is running click house and shut out CloudFlare they run real hardware and I'll never do that again in my life. But uh, one of the things that they alluded to and one of their kick ass blogs about click house is that it replicates so fast that they found it more performance that when a disk in a like raid 10 dies, they just wipe all the data, rebuild the disk essentially empty and just have click house refill it itself. It is crazy fast in terms of rough application. Since all that is compressed, it really just sends that across the wire. Some of the other stuff that, you know, we found completely great in terms of operationalize is that since it is CPU bound, it's mostly by reads when you are right heavy company, and you're now bound by reads in terms of cost of goods sold, like, I can throw around a million high fives after that. It's great to just watch, you know, people log in and actually look at their data and watch our graphs tick up, instead of just saying, Well, you know, we spent a lot of spend a lot of money on this, and people are only reading, you know, 1% of their data. One other piece that I'd be remiss to not answer is that some some niceties about click house that kind of separated for a few of the databases I've worked with is that the ability to kind of set some very quick either like throttling or kind of like turbo ng settings that you have on a client side. So some of the things that we might do is that if we know that a query is going to be expensive, we could you know, sacrifice a little bit of resources and Kind of like turn it back fast. So there is just a literal setting that is Max threads where I say, you know what, I really want this to run faster set max threads to eight instead of four. And it does exactly what it says it does, it'll run twice as fast if you have it twice as many threads. So they're pretty easy things that we kind of run around in terms of operational wise, I think that as far as a database goes, you know, one of the hardest things to do is just kind of read all of the settings to figure out what they do. But after you kind of get versed in it, you'll understand you know, what applying this setting might be or at what threshold, you might set something, and it's not very magical, you know, some of these settings, realistically are for very explicit types of queries that you'd only supply from a client side if you really needed them. So fairly, I wouldn't go so far as a simple like the configurations almost like dumb, and then either straightforward, very straightforward. Yeah.
Tobias Macey
0:33:58
And my understanding is that clip has itself is written in C. So it runs very close to the metal on the instances that it's running on, but that for clustering, it relies on zookeeper, which as you mentioned, the JVM process ease. And so since you're already running Kafka, imagine that you already have a zookeeper cluster for that. So I'm curious if you've been able to just use that same zookeeper cluster and sort of a multi tenant fashion to service both Kafka and click house or if you built a dedicated zookeeper cluster or a zookeeper node for the click as databases on their own.
James Cunningham
0:34:32
So the question kind of a two part answer here, realistic for Kafka. I have a long winded story about going from Kafka like pre oae to to post 08 for whatever they introduced, storing consumer offsets inside of Kafka itself. After that zookeepers really only dependency for like, Hello, you are consumer one you have control over these partitions, and you know, leader election and all that strong consistency stuff, but Realistically, you know, if you had some IoT toasters, you could throw down a copy of zookeeper on that. So we we bundle those alongside, you know, I guess like, physically co located with, with the coffee's that are running. We have those zookeeper and one of the other things I kind of learned from managing zookeeper at my old gig is that you if you have the opportunity, you really don't want to mix things in terms of a multi tenancy for zookeeper and a strongly consistent system. If for some reason that's breaking down, you don't want it to push that load to someone else. So zookeeper for click house is dependent on you know, who has received which block and where to get that from. So it's kind of used for the way that replication is designed. It's also kind of helpful for I guess, a setting that they have, that is a select sequential consistency, which actually didn't work out too well for us. But you know, it is one of those ideas that if you want to read this, this block, you can block Whether or not all the replicas have gotten them so that they kind of use zookeeper as that consistent state for which blocks live where where did this right go to? If I want to get this right, which server Should I reach out from? So and you know, in terms of gossip in terms of adding more replicas, realistically not that much pressure is applied to zookeeper, but it is physically and logically segmented just because that's kind of one of those like tin foil hats that I wear around here.
Tobias Macey
0:36:29
Yeah, working in operations it's definitely always the better safe than sorry approach because as soon as you're not safe, you're definitely kind of be sorry.
James Cunningham
0:36:38
Absolutely. There's there's a question that I don't really like answering which is Oh, hey, you know why? Why is Kaka backing up? Oh, well, that's because that's because click how's it doing something wrong? And someone turns around me of you. Why in God's name. Did these two systems not know about each other that are negatively impacting each
Tobias Macey
0:36:55
other? Absolutely shared nothing.
Ted Kaemming
0:36:59
Yes.
0:37:00
Sort of our disaster scenario is like click house goes, you know, completely offline and the zookeeper cluster would be trashed. In this case, at least our Kafka cluster is still running, and just queuing up rights that we can hopefully when we recover right to, like, God forbid, that happens, but at least if it does, there's some there's this this running boards.
0:37:23
There's some amount of installation from from that.
Tobias Macey
0:37:26
One of the other interesting pieces to discuss is because of the fact that you already had a running system, you already had this large volume of data, as you mentioned, a couple of times the 52 terabytes worth of events. What was your process for introducing click house and snooping into that architecture and replicating the existing data into it while still maintaining uptime and still servicing end user requests so that they didn't notice that cutover and the introduction of this new system except in a positive sense?
Ted Kaemming
0:37:55
Yeah, that's a that's a good question. So I think I think this entire process from start to like being at 100% live probably took roughly a year, give a give or take.
0:38:13
And during that process, I don't recall that we had any outages related to this or any downtime, like not even average, like plan maintenance related to this as well, which is kind of while because yeah, like we talked earlier, people use century to monitor their applications. If the monitors down, you know, that doesn't reflect particularly well on our quality of service and also leaves all of our customers in the dark. So, what we ended up doing is we were we were both dual writing as well as dual reading to those legacy systems that we discussed earlier. Redis, Postgres, all these other things, we're running with those reading from those while we were spinning up the new We have a slash click house cluster. So we were doing that for probably months towards the tail end of this process with that what we were doing and I don't think this is a particularly novel approach, but it does seem like it's something that not many people talk about, at least from, from what I've seen, maybe that's just people consider this aspect less interesting than talking about the shiny new tech. But we earlier we talked about having these sort of abstract service interfaces, which came from just the flexibility of it being an open source project, we built this thing called the service delegator, which basically delegated requests to these services in the background. So we could run a request against the TS dB, and we could, you know, run against the legacy system as the primary and in another thread, we'd also execute that same request against smilla. It would wait for both of those requests to complete then it would shovel a bunch of data into college. That we would actually then do some analysis on and right back to a different click house cluster, where we do analysis on that to figure out essentially how similar the results from from both of these systems were. So this allowed us to test, you know, different schemas on the click house side, we tuned a bunch of settings, there's basically no risk with that, because worst case scenario that the click house cluster would back up on load, we would just turn the switch down, let it drain, and, you know, change that configuration setting again, and spin the the reeds up in the background and figure out if that solves the issue. So we went through a bunch of different iterations on the schema on you know, all these configuration parameters to figure out basically what was the best fit for our needs. This also allowed us to get a lot of operational experience in production without the fear of any sort of like user facing incidents or outage. There were a few hiccups you know, we learned definitely some stuff operationally. A lot The way that have had this has been something where we had, you know, the, like light switch analogy where we just go like, well, we we think we tested this really well, the test suite passes, let's flip it and see what happens, I think we probably would have would have had some fairly embarrassing results coming out of that. So yeah, there was a, there was a lot of just kind of plotting of these similarity scores between these different back ends identifying, identifying these in consistencies like the sequential reads that I mentioned earlier showed up as part of this process where we realized there was this sort of interesting distribution when we were looking at the results from our time series database, basically the the output that we're seeing when we plotted these similarly there similarity scores, which were normalized from basically like a zero to one. We'd see this by model distribution where there were roughly half of the results that were exactly the same, and roughly half that we're not at all All the same, and took some digging to figure that out and realize that was because this automated system was making a request as soon as it thought that the request had been persisted. But it had been persisted to the legacy systems. And it hadn't actually made it all the way through the Kafka pipeline, so to speak into click house. And so it was issuing requests against cookhouse that either weren't on the server is requesting them from or that just hadn't even made it all the way through the processing pipeline yet. So that's also something that we wrote a particularly long winded blog post about as part of this field guide series. And it kind of it kind of walks through that process and a little bit more detail. But basically, by the, by the end of it, we had been we had been running for a couple of weeks, I think at 100%. Both the old system as well as the new system. We have all this consistency monitoring in place. We hadn't seen anything particularly aberrant and we
0:43:00
Just basically, you know, cut over from the the secondary to the primary and some queries got a little bit faster. But otherwise, it was essentially a non event at that point, which is probably the most relaxing launch that I've had of like anything in my career, I think.
Tobias Macey
0:43:18
Yeah, that's definitely always a good experience.
Ted Kaemming
0:43:21
Yeah, it was, it was very low stress, which doesn't always happen and kind of like infrastructure engineering land.
Tobias Macey
0:43:30
And now that you have cut over to using Suba, and click house, and you're storing all of your event data in this new storage layer, what have you found to be some of the downstream effects of being able to have this greater reliability and consistency? And what are some of the other consuming systems that have started to leverage click house and the data that you've stored there for different purposes that were either previously impractical or impossible?
James Cunningham
0:43:59
Yeah, so really Basically, we set out to replace these systems. But you know, the backgrounds that are right, like, you know, now that we've moved into the new house, you know, can we start building some shiny things on top right? So all of the products that we launch, again, shout out to our blog, pretty easy. Now, all of them are, you know, any graph that you might see any search and you filter, any sort. That's all powered by stupid click house. So all the features that were that we're developing for greater data visibility, all of those are powered by this new search system that could not have existed previously. So many other more like infrastructure benefits that we've kind of seen is that deleting event driven data from Postgres was a pretty easy query on paper, or you just say, delete where timestamp is greater than retention. And that's great. When you don't have to do any job deletions. You might be able to do simple, but if we're deleting individual events, one by one, we are literally doubling The amount of commits that we and the amount of transactions, the amount of walls that we're writing to this to this Postgres machine, just to be able to not run out of space, and to make sure that we actually delete our customers data when we said we would. So now and click out since it's a column oriented storage, we, I guess, like physically segment all these files across weeks. And once one of these segments roll out of retention, we just delete the folder, a running drop partition takes seconds, and it just is an unlink. And it's one of the easiest things we've ever done. It is a great thing to say, Hey, what happened to the data is I got I got deleted, no notice, that's fine. One of the pieces of resilience is that whenever we wanted to upgrade some of these other machines, whether that be you know, rice or Postgres, or even react, upgrading processes were a very long amount of time for operations. We would all get into the same room. We have task plans, they would be you know, uh, you know, checking off these boxes. You know, shuffling these tiles over here turning this off, turn that on, you know crossing your fingers PG bouncer is good. All right, we're good. But with click house, you know when we have these, you know, three by three shards realistically all we're doing is just saying alright bumped the version restarted or and it looks cool on this machine push it to the other eight. There have been times where we've rolled back because of you know, maybe various settings that we missed to set maybe something that might have been baked into a default that realistically should have been. I think in my in my time of upgrading, I probably upgraded about 12 times to whatever releases they've cut. I think maybe like three or four of those have been rolled back. But it's really just whoops, that didn't work and nobody noticed. Well, hopefully nobody noticed. You know, I kind of save that for maybe a Saturday or Sunday because our traffic's pretty us cyclical. So if using central on the weekends, I don't know go back to watching TV. But realistically in terms of our in terms of our downstream benefits, like
0:46:55
it's the compression it's
0:46:57
it's all the ease of operation Realistically is just something that we don't babysit that much anymore.
Ted Kaemming
0:47:03
Yeah, I think like anecdotally the support load on like, why is this system doing something weird has gone down, like, the data stores don't get out of sync in the same way they could previously data comes through in a more timely fashion. So there's no Hey, why why does this count not match this other account has all accounts now come from the same spot. It's also something where I think the cast is kind of snuck into some different places of our internal systems as well. All like the requests that come through century both from the load balancer, as well as all of the results of essentially every request that hits our application servers also goes into click house, which is something we had always talked about, you know, when someone writes in and says, Hey, I have this you you ID, what happened, you know, we can actually give them an answer with I'm just like grepping through tons of log files or just saying, Well, it looks like you know, there's a lot of rate limited activity. So probably happened over there, we can actually give them a conclusive answer. So it's it's definitely something that we're getting a lot of mileage out of, and we're finding new uses for it routinely.
Tobias Macey
0:48:20
And the snoozefest system itself, it is an open source project. Is it something that is potentially useful outside of the context of century where somebody might be able to adapt it to different search implementations? Or is it something that's fairly closely linked to the way that the century application is using it? And it wouldn't really be worth spending the time on trying to replicate the functionality for a different back end or for a different consumer?
Ted Kaemming
0:48:48
Yeah, that's a really good question. So as of literally today, it historically has been extremely coupled to our domain model. It is less coupled to our domain model today. But it's still pretty couple, our domain model. Unfortunately, most of the work that has gone into it has been motivated by our product needs. So a lot of the early internals are pretty closely coupled to, to those original goals, really just as a function of trying to, to get it out as quickly as possible. So we could stop turning on Postgres servers and dealing with Postgres query planner related outages or transaction, wraparound, those kinds of just like, scary nightmare things. So yeah, it's originally pretty coupled to the original data model. That data model though is evolving. We're adding new data sets and the click house, like I mentioned, those event outcomes that get logged into it. So it's internally, there's a lot of that refactoring happening to reduce coupling. I think sort of the the logical end state of that is something that is more abstract, but as it is right now to add a new data set of data set of sort of our analogous concept to table, you essentially have to fork the project all the DDL happens in code. We'd love for that to be sort of more declarative, whether that happens to be a configuration or some other means where you're not getting in there writing Python and like literally editing the projects. That would be, I think, where we'd like to see it go. Also right now. The code is pretty stable documentation is definitely a work in progress. So that's something I think for anyone who wanted to experiment with it, we need to invest more time in making it easier to do that. The code itself though, is obviously it's it's on GitHub, anybody is free to you know, browse around, take a take a look at it, you know, we we do welcome contributions of the right now the the barrier to entry is definitely higher than we would like it to be in the long term. I
Tobias Macey
0:51:00
Are there any other upcoming new projects or any other major data challenges that you're facing at century that are either causing enough pain that you need to do a major refactor or anything that is forward looking that you're able to spend time and focus on now that you freed yourself from the burden of trying to maintain these multiple different systems? And they're lagging consistency?
James Cunningham
0:51:24
Yeah, solid questions. So we really did just this test that we really did just cut over from that I did is that we call outcomes for for all the data that is in this main class cluster, those are all accepted events. But you know, we have a certain amount of outcomes that that do happen, you know, we have an event that might be filter, an event that might be dropped due to a configure rate limit, or if someone has gone over their project quota. We have invalid events as well. We have a pretty substantial abuse here from people that just keep sending us events, even though we told him to stop and all of those counts. Used to go into kind of this Ts DB model for all of the other dimensions that we had. And honestly, I think just this Tuesday, we finally stopped writing about 300,000 increments a second into that data set for Redis. And it's been all being written into snippet at this point. Moving forward. Yes, we can can't give away the full product roadmap, but you know, we've had a lot of like, draws from the idea tanic and say, what happens if, if century starts storing more than errors and crashes. And all all the click houses kind of is kind of centered around that at this point. So doing a good amount of internal testing, if anybody wants to read some source code changes, they could probably learn a little bit more about that. But I guess Good luck on that reading. We do. We do okay at documenting our pull requests and what they're about, but there's a lot of lot of chatter on on internal and internal lines that someone might not have a lot of context on, but
Ted Kaemming
0:52:58
you can read between the lines and figure out Where where things are going? So it's we're not particularly hiding anything, even though we might be a little bit cagey about it. Yeah. Yeah. One of the other things that sort of like a challenge, moving forward beyond just like new types of data is, so I mentioned the search back end before and how the search back end was basically querying a bunch of different Postgres tables, those tables eventually started to live in different databases. And joining across those databases is hard. That's still basically a problem that exists today, except now, rather than joining between Postgres on one host and a Postgres handle on another host, we're joining between a Postgres table on one host and click house, you know, a completely different cluster. So one of the things that we've been experimenting with is implementing a Change Data Capture pipeline, essentially replicating our data from Postgres via the logical decoding interfaces and writing it through Kafka into into click house and so click house would actually act as a read replica for this data that exists in Postgres, we haven't really stressed, tested any sort of complex join clauses yet. So we're going to kind of see what happens with that. But ideally, if we can have these, these cross database joins effectively not be crossed database and have that data be co located, potentially, the compression works so effectively, and the the subset of data that we have in Postgres that we need to join on is so small, that we can probably have full replicas of these Postgres tables that are using search on literally every click house node. So it'd be a local join on all these servers, which would be wildly better than passing around large sets of IDs, like we have to do in some cases now. So that that's definitely something that is kind of, on the on the roadmap that we we are looking to figure out that also in accordance with sort of our general philosophy that's going to be available on GitHub for people for people to Experiment with on the ground as well.
Tobias Macey
0:55:03
Are there any other aspects of your work on scuba and using click house and just the broader context of search and storage at century that we didn't discuss yet that you'd like to cover before we close out the show?
James Cunningham
0:55:15
Shameless plug, we're hiring
0:55:19
it if anything sounded interesting, like finding on the internet, that's about it.
Ted Kaemming
0:55:25
I guess one thing that we didn't really talk about with, like future work is right now as as a product century basically provides a lot of it gives you access to a lot of data, but it doesn't kind of there's no opinionated insight that comes along with that. So you know, there's there's definitely value in like opening century looking at your issues stream and seeing that a thing is is particularly bad relative to history, or whatever. That doesn't give you any sort of quality. rotative information about, like, if this number goes up, does that mean this thing is bad? Or, you know, there's there's basically a whole lot of context that's missing there. So one of the things that we we've been talking about is trying to supply more more context as well. But at least as far as the search and storage team, like beyond just the the road x of like, literally searching and storing, I think the, the ability to get more insight from the data is something that, you know, is kind of on our roadmap that we're going to be tackling, because now now the data is where it needs to be to be able to do more interesting stuff with it.
James Cunningham
0:56:38
I think an analogy that we kind of have with century right now is if you were to walk into an American diner and get a get like a 10 page menu, and you know, the waiter walks up and says, Hey, how you doing, say, Oh, it's what's good here, and then you got everything. Everything's good, like,
0:56:55
ah, sure.
0:56:57
So hopefully we can stop Storing and maybe start suggesting your future. All right?
Tobias Macey
0:57:04
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today, I'll start with you, James.
James Cunningham
0:57:23
It is very difficult to scale up from someone's individual laptop, kicking the tires on a database to what very well could be a production shape. There are literal teams that might be dedicated a company that just have the ability to assess the impact of someone's query that we're they're writing, you know, how, how can we go from a laptop and say, That's bad. That's blessed. So right now, I think one of the largest gaps in terms of tooling is being able to assess impact from going from, you know, a single SSD to what might be a data warehouse.
Tobias Macey
0:57:57
All right. Do you have an answer Ted,
Ted Kaemming
0:57:59
I think Kind of kind of building on James's answer to some extent, the ability for us to have a good understanding of when this particular query is executed? How, how many resources is this going to take? that historically, was something that we had a lot of issues with in Postgres land, because the, the query planner would, you know, do some predictive things. 99 point some number of 9% of the times it would do the thing that you expected and wanted it to do the other, you know, some percent it wouldn't be and they're, like with Postgres, with explaining, there's, you know, a lot of insight about why it's chose the particular thing again, but there was no like, kind of general model for being will say, Hey, I think this, this query will cost too much like don't run it. I know that there's been some amount of research that's happened around that. I have not directly seen any of it. I think having that sort of cost estimation and you know, being able to do more effective kind of rate limiting based not on just like quantity of queries, but costs of acquiring this would be something that would be interesting to see more of these. I guess I would just say consumer grade databases, something like that would be really interesting. With Cliff house, we, we have extremely predictable costs, because it's just basically scanning like everything, everything is going to be scanning something and the cost is particularly easy to estimate from there. But as things get more complicated, there's going to be more more variants in all these runtimes basically, and it'd be nice if it was something that was a little bit easier to account for in a way other than just the volume of course.
Tobias Macey
0:59:58
All right, well, thank you both for Or taking the time to join me and share your experience of building and deploying this new bus system and your experience of working with click house. It's definitely great to get some of that insider experience and discuss some of the pains and successes that you've gone through. So thank you for all of your time and your efforts on that front end. I hope you enjoy the rest of your day.
James Cunningham
1:00:21
Thank you
Tobias Macey
1:00:28
for listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways that is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Building A Real Time Event Data Warehouse For Sentry 1