Building A Real Time Event Data Warehouse For Sentry

Hello, and welcome to the Data Engineering Podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode.

That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today.

Your host is Tobias Macy. And today, I'm interviewing Ted Kaming and James Cunningham about Snooba, the new open source search service at Sentry implemented on top of ClickHouse. So, Ted, can you start by introducing yourself?

Hey. My name is Ted Kemming. I am a software engineer,

here at Sentry.

I work on the the search and storage team, which is

responsible for, as the name would imply, search and storage. A lot of that happens through through ClickApps these days.

And, James, how about yourself?

Yeah. Like you said, my name is still James Cunningham.

I manage the, search and storage team. I also manage engineering operations.

I was the 1st operations engineer to deploy and maintain our ClickHouse clusters. So good person to talk to.

And so going back to you, Ted, do you remember how you first got involved in the area of data management?

I it's something where so I

At that company, we had a lot of reads compared to Century, which is incredibly write heavy. I worked also on an infrastructure team there as well,

mostly

dealing with

shoveling data around as you do in kind of any any web application.

So I I don't know if I necessarily characterize myself as someone who has

has really focused on on data management as a specific

specific career path or anything like that, but it seems like

on any any web application,

the shared state for that application is so so critical to that application's functioning

that everybody kinda trends in that direction,

the the more these things scale.

So through through my career, yeah, I've I've just spent a lot of time interacting with databases, and you get to know them fairly well through through that process, I guess.

Yeah. I'm familiar with that. It sort of follows you around as much as you try to run away.

Yeah. For sure. Yeah. Everything eventually will talk to the database, it seems like.

And, James, how about you? Do you remember Hey First got involved in the area of data management?

I did.

Before

I met the wonderful people at Century, I used to work at a place called Urban Airship that's currently known as Airship.

We had a

nontrivial

amount of, HBase Clusters,

the open source implementation of Google Bigtable, and I I was privileged enough to kind of work alongside those. And I said, wow, these are kinda pretty cool.

We

paired that a lot with Postgres. Sentry ran Postgres. I actually operated Century

at UA.

Then I came over here because I had direct experience.

And

when we kind of started moving away from being a Postgres shop, I said, it's it's my time to kind of dig back into my brain about how non row storage kinda worked. And

ever since then, I've been loving this and kind of hating Postgres. So

doing doing pretty well. We still like Postgres. Don't.

Yeah. So I know that when I interviewed Dave Kramer on my other podcast about Sentry and some of the early days of that, he was mentioning that at least for a while, the overall data infrastructure was running entirely on Postgres and didn't use Elasticsearch or anything, just used some of the full text indexing on postgres

for being able to handle searching across the event data. So I'm wondering if you can just start by describing some of the

stresses and pain points that you are running into on the existing infrastructure

and how that led you to deciding that you needed to rearchitect and build this new system.

System? Yeah.

So

when I joined, which was which was pretty early on in the century,

as a as a company's history, At that point,

when events come into Sentry,

they they are basically being written to a a collection of of different data stores,

after processing.

So

we kinda logically split that up into 4 different

kind of service components, I guess you could call them. There's the time series database.

That was backed by Redis at the time.

Primarily, that's that was just counters bucketed by a time granularity.

It's we also had some, some sketches, hyper log logs, stuff like that in there to get unique counts. So TSW was 1.

Another 1 was this thing called tag store.

Tag store was basically just a abstraction we had around dealing

with denormalized

data for, like, the tag key value pairs that come along with an event.

So that was in Postgres.

Started off on the same Postgres database as everything else. Eventually, it got split off into its own database, and then we eventually had to shard those tables across a a larger cluster.

Then, also, another logical component was the search back end. The search back end was really just postgres. I don't think we ever actually had Elasticsearch running for any thing, at least during during my history here in in production.

So the search back end was, like you like you mentioned,

just

running SQL queries over these tables.

That,

as you can probably imagine, when those tables became physically separated on the different hosts, got more complicated.

And then the the 4th place that event data ended up in, is a abstraction called NodeStore.

NodeStore

at the time was running on React,

from Basho, which has since kind of been EOLed.

As Sentry, though, is an open source product as well,

all of these things have different pluggable back ends. So other people who are running Sentry and their own

infrastructure, you know, they could run it just using Postgres again. They could use it running Cassandra.

These days, we, run it using Bigtable.

So we had we had all this data split into these these 4 different storage systems that was all basically

derivations of of the the same input data.

Those rights generally happened

via Celery, so a a kind of task abstraction

written in Python. Most of Century

is is Python. We're

big fans of Python, honestly. We've been able to to make it work for our load. But since this is all kinda going through Celery and we're using Revit,

at the time as the the message broker,

we're just mutating single rows at a time. That became a performance pain point. Some of that ended up

in a situation where we were

buffering,

these rights, so increments of these counters, things like that,

through this kind of buffering system that we built in Redis and would apply

bulk updates to those denormalized tables. So, basically, we just had a bunch of different

thing bunch of different data systems laying around,

managing

different

different lenses

of of looking at the the same actual input data.

So it it got particularly complicated.

And were you starting to run into any user facing problems, or was most of the pain just in terms of the internal operations and trying to deal with managing the data and the cognitive load on the engineering side?

There's definitely

there's there's both kind of that internal pain as well as the user facing pain. I think really

a lot of the

the user facing pain was just manifestations

of the internal

pain. So in internally, like like you mentioned, there's all this

operational complexity with running all these different systems. There's just

tons of points of failure.

Scaling

with the data volume was an issue.

Also, yeah, the code complexity.

All all those things eventually

become user problems because the operational complexity

of having all these single points of failure, you know, that affects the user experience when those single points of failure fail. And, you know, if someone can't can't look at their data on Sentry,

the code complexity manifests itself in the fact that we're not able to iterate and, like, build new features as quickly as we'd like.

Along with that, there is a bunch of other stuff with, like, consistency problems because all of these systems are ingesting at different rates or, like, this buffering system time sometimes would back up. And so you would see these denormalizations

that hadn't been updated in in potentially several minutes in in worst cases.

Those kinds of things would obviously affect the user experience.

When those things affect the user experience,

they write into support. That means an engineer is having to investigate what's going on. All of a sudden, the engineer can't make forward progress on improving these things because we're busy, you know, diagnosing and explaining this sort of idiosyncratic behavior. Overall, you know, things just

kinda kinda slow down from there.

So yeah. It's it it was yeah. It was it was both, internal, I think, as as well, externally. Like,

just not what we would have liked to present in a idealized situation.

Yeah. Especially given that your product is

used in such a way that people wanna be notified in a timely fashion and potentially need near real time capabilities

because it could be affecting their customers. So definitely a lot of cascading errors that can propagate from what seems like a simple tech debt problem to something that eventually explodes into something that, like you said, grinds everything to a halt.

And so that finally catalyzed you into

deciding that you needed to redesign and rebuild these capabilities. And I'm curious what your

design

and operational criteria were for determining what needed to go into building this new platform?

Yeah. So I'd say as far as

all the decisions that we made in order to go into this new platform, 1 of the biggest leaders was that when we had a big push for having environments be kind of like a first class filtration, we had to build a new dimensionality of data across all this denormalized data. We essentially doubled the storage that we had. And then we said to ourselves, like, oh, this is great. This looks cool. Environments are dope.

But what happens when we wanna add another dimension and another dimension? Are we just going to continue to,

I guess, like,

extrapolate across this data set and eventually end up with a 100 terabytes of,

you know, 5 different dimensions of data? So we said to ourselves that we kind of needed a flat event model that we'd be able to kind of search across.

And

to ourselves,

you know, there are a few other pieces that we wanted on top of that. We wanna be able to search across some of these arbitrary fields that we really, really looked into, whether those are custom tags or something that we kinda promote, whether that is, like, releases or traces

or searching across messages.

We didn't want that to take as long as it did. And some of the other parts is that we have all of this data stored

in, you know, this tag store and all these searches that we have to go through, but we have a completely different side for time series data that, again, had to have that dimensionality in it. If we search across these arbitrary fields, the next thing that a customer would ask for is, hey, could I please see a pretty graph?

So if we could boil down that search and that time series data into the same system, we'd be destroying 2 systems with 1 rewrite.

And also, like, as part of that prep process, I mean,

you kinda always have the the standard checkpoints, you know, like, the replication and durability is obviously really important for us. Ease of maintenance is huge.

Low cost

as as well for us. So

even that just kinda ruled out some of, like, the the hosted magic storage solutions.

Like, those those kinds of criterias will apply.

And as you were

deciding how to architect this new system, can you talk through some of the initial list of possible components that you were evaluating and what the process was for determining whether something was going to stay or go in the final write?

Yeah. Of course.

So

the our first,

I guess, thing that we kind of crossed off is no more row orientation. Postgres has served us well. Probably wouldn't you know, we hoped that we could engineer

a good solution on top of it. But ultimately, we decided we probably needed a different shape of data to be able to query across. We've kind of had,

like, 5

major options. We had document stores. You know, we had some sort of

Google proprietary blend because we are completely on GCP.

We had, you know, more more generic distributed query stuff, you know, a little bit of Spark, maybe a little bit of Presto. We took a look at other distributed databases.

We ran a good amount of Cassandra at my old gig, so I know how to run that. And we also said, like, oh, hey, we could just, like, put data down on disks ourselves and not have to worry about this.

Some of the other, like, serious considered things that we had was a was a column store.

Some of these other ones that we actually, like, kick the tires on, was Kudu. We kicked the tires on Pinot and Druid. And ultimately, we found ClickHouse as a commerce store,

and we kinda just started running it, and it was 1 of the easiest ones to kick the tires on.

Some of these other, like, I guess, you know, column restores built on top of, distributed file systems, it really did take a good amount of bricks to put down in order to get to your first query. And some of the things that we wanted was figuring out operational costs on that. We wanted to be able to iterate across it quickly. We wanted to be able to kind of pair down on the dependencies that this service had.

You know, while we weren't afraid to run a few JVMs or to run if, you know, a little bit of HDFS,

That was something that realistically I might not want to have to have, you know, an an entire engineer dedicated to running something like that. And on the antithesis of that, you know, we could choose some of this Google proprietary blend, but how would it feel to go from

having Sentry only require writing some Postgres to now saying you can only run the new version on Google? Yeah. That was a little bit silly. So

we ended up really just getting through an MVP of, I think, both Kudu and ClickHouse.

And 1 of the 1 of the biggest ones that really did kick us, and for anyone listening, go ahead and correct me if I'm wrong. But

1 of my memories was that 1 of our engineers,

you know, started loading data into Kudu, and you didn't really know when it was there. It was great for, you know, being able to being able to crunch down a bunch of numbers. But

1 of our biggest things that you did kind of hint at is that we do need real time data. And

to be able to write into this, data store and then to be able to read it on a consistent basis will pull 1 of the things we needed.

We have the ability to have a feature called alert rules in which you say, hey, only tell me if,

you know,

any event with the tag, you know, fu man got in and the value equals to,

which there was only maybe, like, 10 events in the last hour.

And you wanna be able to read that pretty quickly so that when that 10th event comes in, you're not waiting minutes until that alert shows up. And ClickHouse is able to do that, and so that kinda

just got its way up to number 1.

Yeah. I think also in general, like, at Century, we try and

kinda bias a little bit towards

relatively simple solutions.

And it seemed like ClickHouse, there was

at least to us, based on our backgrounds, it seemed more straightforward to get running.

And I think that, as well, appealed to us quite a bit. The documentation

is pretty solid.

It's also open source. You know? A lot of these will be, but, you know, ClickHouse has a pretty active repository. They've been very responsive

when we've had questions or issues. They're very public about their development plan. So I think a lot of these things just kinda kinda worked out in its favor.

Yeah. It's definitely,

from what I've been able to understand, a fairly new entrant into the overall database and data storage market, but I've heard of a few different stories of people using it in fairly

high load environments. So I heard about the work that you're doing with Snooba. As far as I understand, Cloudflare is also using it for some of their use cases, and they definitely operate at some pretty massive scale with high data volumes. So it seems like a pretty impressive system that has a lot of different capabilities. And I was pretty impressed when I had some of the folks from Altinity on the podcast a while ago to talk about their experience of working it and working with some of their clients on getting it deployed.

And I'm curious

what some of the other types of systems you are able to replace with ClickHouse were,

given that you as you said, you have these 4 different systems that you had to be able to replicate event data to. Were you able to collapse them all down into this 1 storage engine?

Yeah. So,

like, in our code base, the those 4 different things, the TSCB, search, tag store, and node star, all have

kinda abstract service interfaces

that really just sort of evolved from the fact that it's a open source project. People wanted to use these these different back ends for it.

3 of

those now are backed by the same dataset in ClickHouse. So all the TSDB data comes directly out of ClickHouse. There's no pre aggregation that happens anymore. It's just, You know, we're just ripping over individual rows, competing with those aggregates on demand, at least for now.

Search.

Some of the data for search still lives in Postgres, but a lot of it now is,

it it just runs in from log data in in ClickHouse, essentially.

Tag store, we've ripped ripped how many servers

were we using for tags?

We had oh, goodness.

Like,

12 n 1 highman 30 twos, which is a 32 core

and maybe 200 odd gigs.

But, you know, getting getting into some of these other stats that we have a little bit more down the list, we went from

52 terabytes of SSDs to 2 terabytes,

which is a good number to bake down from. Yeah. So we were able to Absolutely.

Yeah. We were able to decommission, like, an entire Redis

cluster, like, cluster in quotes.

And this entire postgres cluster with

drastically less hardware.

And, yeah, just the fact that it all reads from the same

QuickHouse cluster

and there's none of this weird replication line between all these systems, that's it's a huge positive.

Can you talk a bit more about the overall architecture of Snooba itself and just some of the operational

characteristics and experience that you've had in terms of Clickass itself and maybe some of the early

pain points and sharp edges that you ran into as you were getting used to this new system?

Yeah. Sure. So I guess just to give you kind of a a brief overview of the architecture because

it's it's something that's really not particularly fancy.

It's

really, Snooba is just a small,

like, a relatively small Flask application. At least small when you compare it with, like, the remainder of Century.

So it's yeah. It's a Flask application. It just speaks HTTP.

It's in Python. It's generally stateless.

The rights, as they come in, they go through a Kafka topic. It's published directly from the the remainder of the Sentry codebase.

The Sentry codebase and the Supa codebase are actually completely independent,

at least as far as, like, the project, like, get tree.

So Sentry writes in this Kafka topic. The Snuba consumer picks them up, does some denormalization,

some data munging, you know, kind of conventional Kafka consumer stuff, and writes large batches of events,

to ClickHouse.

We don't use the ClickHouse Kafka engine or anything particularly special for that. We just use the Confluent Kafka driver from Confluent,

which is lib already Kafka based. And that's, yeah, all in all in Python.

Reads, just new that happen also over HTTP.

Not anything also particularly fancy there.

We have some various optimizations

that we we do, kind of just a general query cache and deduplication

of queries,

so that way we don't have

large queries that have long run times

executing concurrently on the cluster. We do some optimizations where we move some stuff from,

the WHERE clause in ClickHouse SQL to a pre where clause, which is basically

the closest thing you get to any sort of query optimization.

And we do some other just, like, query rewriting stuff based on our our domain model. There's other rate limits and quality of service, metrics logging type stuff that happens in there as well.

As long as that all goes well, our response is returned to the caller with something that is

almost identical to what you would get if you were just interacting with the HP interface of ClickHouse itself.

If it doesn't go well, that ends up getting logged to Sentry. And we

we then kinda

enter the system again to go to go look at it. So that's kind of the the brief overview. It's

it's nothing particularly fancy.

Yeah. Sometimes simple is best, particularly when you're dealing with something that's as critical path as this. Yeah. For sure.

Yeah. So towards a little bit of the the early engineering that you might have alluded to, 1 of our I'd say 1 of our biggest

early difficulties was that

we've, you know, we've we've spent a lot of eggs in the post rest basket. So

we turned this on and, you know, the queries that we've set up for a row oriented database are just, like, absolutely not meant for a columnar store, which is a crazy thing to say out loud.

It's so easy to type select star. Wow. So easy.

Spelling is hard.

But, you know, there's there's some things that just absolutely did not cut over to this commerce store that we kind of had to, like,

redesign how we had every query. You know, Century kind of had a a quick application of

order by some arbitrary column and then limit by a 1, 000 to be able to, like, explicitly hit a binary tree index in Postgres. And that didn't matter in ClickHouse. You know, any sort of limits just

kind of truncated

what rows you are returning. If you applied an order by, that would have taken your entire dataset and ordered it. Some of the other things is that we had a lot of select stars everywhere, like Ted said, and that is

honestly 1 of the worst ways to operate on a columnar store because you're just reading from every literal file. So we had to change that a little bit. Some of the other things that we kind of had, you know, we we didn't have a query planner, so there was a lot of, like,

taking a query and just kind of

moving pieces around.

1 of the things that Ted alluded to was a notion of a pre wear. When you have,

you know, multiple columns that you wanna you wanna filter on in a where clause, you kind of have the ability to to give ClickHouse a little bit of heuristics and say, this is the column that we believe has the highest selectivity.

And you put that under a preware clause, it will read through that column first, you know, decide which block IDs it's going to read from for the rest of them. Them. So if you have something along the lines of an event ID that for us is, you know, globally unique, that might have a little bit higher of a selectivity than

an environment

or, you know, a release might have a little bit of higher selectivity. So

we were kind of working around these edges by just

swapping variables around and saying, well, did that make it faster? And then if we said yes, we kind of threw some high fives around.

Yeah. They're,

like also just the integration

into

some of the query patterns we have in Century

was a bit of a challenge.

ClickHouse is really designed to do particularly well with inserts. It does not do particularly

well with updates or deletes to the point where

they aren't actually,

like, syntactically

valid

in the, like, click house flavor in SQL.

So we have

like, Century as a whole is particularly insert heavy, but it's not insert only. And so we had to kind of work around,

basically, the fact that ClickHouse is is extremely oriented towards inserts.

We kinda ended up with something that actually,

James mentioned he worked on Cassandra in a past life. I did as well. We ended up with a architecture that is fairly similar to the Cassandra tombstone for how we delete data, where we kinda implement our own last right win semantics on top of the replacing merge tree in ClickHouse.

There's a

a long blog post

about how we do that, as part of we have this field guide series that we've been working on where we go into some of these, like, weird things that we do with ClickHouse.

Similarly,

for things like those alerts

that James mentioned earlier,

we basically

require sequential consistency to be able to execute those queries effectively.

That becomes a problem when you're dealing with multi master replication,

like ClickHouse does. So

we ended up having to do some kind of dodgy load balancing

stuff,

where we we don't have a a literal primary

for

all rights, but we kinda have this ad hoc primary

that all rights go to as long as that is up. And for some subset of queries, they are only allowed to evaluate,

on that that primary.

It's not, like, guaranteed sequential consistency in, like, a true distributed system sense, but it's it's good enough for what we need. It's also

particularly complicated because the

system doing the querying is not snuba.

It's it's lives in the Sentry code base. And so we basically need to be able to notify

the Sentry code base that these rows have been written to click accounts

from Snuba as part of this. So we ended up having to engineer this solution where we

have a commit log coming out of the snuba Kafka consumer

that the Sentry application is actually subscribed to that Kafka topic, the commit log Kafka topic, and gating its own progress

based on the progress of the snoopwriter.

There's also a blog post that goes into more depth about how we specifically implemented that on the the century blog as part of this field guide series. But just, yeah, things like that that you

like, we we knew things, like, mutations

were gonna be something that we had to manage. We didn't particularly have strategy around it. And the, the sequential consistency stuff

probably caught us a little bit more by surprise than it should have, as we were doing some of our our kind of integration

testing,

in production with this and and noticed that some of the queries weren't exactly

returning what we thought they would have.

And so that was that was something we also had to solve.

And you mentioned that 1 of the reasons that you ended up going further forward with ClickHouse than any of the other systems is that it was pretty easy to get up and running and seemed fairly simple operationally. So I'm curious what you have found to be the case now that you're actually using it in production and putting it under heavier load and in a clustered environment,

and any sort of useful lessons that you've learned in the process that you think anybody else who's evaluating ClickHouse should know about?

Absolutely.

So this is this is my time to shine.

So 1 of the things that I kind of had to had to make a concession to is that I've never

worked with a database

that could possibly be bound by CPU.

It's always been, you know, make sure that your disks are as fast as possible, you know, the the data is on the disks. You gotta read from the disks.

And the the reason that, you know, it it very well can be found by CPU is that,

you know, I've I've seen compression in the compression in the past,

and I didn't really understand what compression could actually give you until we had we turned ClickHouse on. Sorted compression realistically, you know, brings our entire data set, you know, we kind of alluded to it earlier, brings our entire data set from 52 there terabytes down to 2 terabytes. And

about

800 gigs of those are surprisingly,

uncompressible

because they are unique,

you know, 32

character strings.

If anyone can tell me of an algorithm that helps compress that, you know, I think I think that we we made a television series around that or something like that. But, you know, for the for the rest of the rest of the data, it's it's so well compressed that being able to actually, like, compute across it does so well. You know,

we we run a small amount of servers to supply what is a a large amount of a dataset. You know, we've

we started I would say that, like, if there was any advice to anyone out there, start by sharding. Never never shard by 2 because 2 is a cursed number in terms of distributed systems.

But we really just started with, you know, 3 shards, 3 replicas.

And, you know, with that with that blessed number of 9, we haven't gone up yet. We kind of have a high watermark of a terabyte per machine. Google gives,

a certain amount of of read and write off that disk based on how much storage you have, and we've kind of unlocked a certain level.

And 1 terabyte for a machine on if anyone else is somehow running ClickHouse on GCP,

I guess, on GCE that is,

you know, we're we're about to apply our our 4th chart.

But realistically, some of the other things that that are operationally sound is that,

you know, as as much as we'd all love to, I guess, like, hammer on or praise XML, it is, it is very

explicit about about what you have to write in.

It's configured via XML. There's no runtime configuration that you're applying. There's no,

you know, magic distribution

of of riding into an option store and watching that cascade into a cluster. You're not auto scaling.

Yeah. I'm not I'm not, you know, crunching in any Kubernetes pods or anything like that. 1 of the things I'd be remiss to not say, is that you did mention,

Cloudflare is running ClickHouse and shout out Cloudflare.

They run real hardware and I'll never do that again in my life.

But,

1 of the things that they alluded to in 1 of their kick ass blogs about ClickHouse is that it replicates so fast that they found it more performance that when a disc in a, like, RAID 10 dies, they just wipe all the data, rebuild the the disk essentially empty, and just have ClickHouse refill itself.

It is crazy fast in terms of replication.

Since all of that is compressed, it really just sends that across the wire.

Some of the other stuff that, you know, we found completely great in terms of operational wise is that since it is CPU bound, it's mostly by reads. When you are a write heavy company and you're now bound by reads in terms of cost of goods sold, like, I can throw around a 1, 000, 000 high fives after that.

It's great to just watch, you know, people log in and actually look at their data and watch our graphs tick up instead of just saying,

well, you know, we spend a lot of spend a lot of money on this and people are only reading, you know, 1% of their data.

1 other piece that I'd be remiss to not to answer is that some some niceties about ClickHouse, that kind of separated it from a few other databases I've worked with is that you have the ability to to kind of set some very quick, either like

throttling or kind of like turboing

settings that you have, on a client side. So some of the things that we might do is that if we know that a query is going to be expensive,

we could, you know, sacrifice a little bit of resources and to kind of like turn it back fast. So there is just a literal setting that is max threads, where I say, you know what, I really want this to run faster. Set max threads to 8 instead of 4. And

it does exactly what it says it does. It'll run twice as fast if you give it twice as many threads.

So they're pretty easy things that we kind of run around in terms of operational wise. I think that as far as a database goes, you know, 1 of the hardest things to do is just kind of read

all of the settings to figure out what they do. But after you kinda get first in it, you'll understand, you know, what applying the setting might be or at what threshold you might set something, and it's not very magical.

You know, some of these settings

realistically

are for very explicit types of queries that you'd only supply from a client side if you really needed them. So

fairly

I wouldn't go so far as to say simple. Like, the configuration's almost, like, dumb. And that is Very straightforward. Very straightforward.

Yeah. And my understanding is that ClickHouse itself is written in c, so it runs very close to the metal on the instances that it's running on. But that for clustering, it relies on zookeeper, which as you mentioned, the JVM processes.

And so since you're already running Kafka, I imagine that you already have a

a zookeeper cluster for that. So I'm curious if you've been able to just use that same zookeeper cluster in sort of a multi tenant fashion to service both Kafka and ClickHouse, or if you'd built a dedicated zookeeper cluster or zookeeper node for the ClickHouse

databases

on their own? Solid question. Kind of a 2 part answer here. Realistic for for Kafka,

I have I have a long winded story about going from Kafka, like, pre 082 to post 082 or 084, whenever they introduced,

store and consumer offsets inside of Kafka itself. After that, zookeeper is really only the dependency for, like,

hello. You are consumer 1. You have control over these partitions

And, you know, leader election and all that strong consistency stuff. But, realistically, you know, if you had some IoT toasters, you could throw down a Kafka zookeeper on that. So we we bundle those alongside, you know, I guess, like, physically co located with, with the Kafka's that are running, we have those zookeeper. And 1 of the other things I kind of learned from managing zookeeper at my old gig is that you if you have the opportunity, you really don't wanna mix

things in terms of of multi tenancy for zookeeper. On a strongly consistent system, if if for some reason that's breaking down, you don't want it to to push that load to someone else.

So zookeeper for ClickHouse,

is,

dependent on, you know, who has received which block and where to get that from. So,

it's kind of used for the way that replication is is designed. It's also kind of helpful for, I guess, a setting that they have

that is

select sequential consistency,

which actually didn't work out too well for us. But, you know, it is 1 of those ideas that if you wanna read this this block, you can

block on whether or not all of the replicas have gotten them. So they kind of use zookeeper as that consistent state for

which blocks live where,

where did this right go to. If I wanna get this right, which server should I reach out from?

So in, you know, in terms of gossip, in terms of adding more replicas, realistically, not that much pressure gets applied to zookeeper. But it is physically and logically segmented just because that's kinda 1 of those, like, tinfoil

hats that I wear around here.

Yeah. Working in operations, it's definitely always the better safe than sorry approach because as soon as you're not safe, you're definitely gonna be sorry.

Absolutely.

Well, there's there's a question that I don't really like answering which is, oh, hey, you know, why

why is Kafka backing up? Oh, well, that's because, that's because ClickHouse is doing something wrong. And someone turns around to me and be, why in God's name

do these 2 systems not know about each other, but are negatively impacting each other?

Absolutely.

Shared nothing.

Yes.

Sort of our disastrous scenario is like ClickHouse goes, you know, completely offline and the zookeeper

cluster would be trashed.

In this case, at least our Kafka cluster is still running

and just queuing up rights that we can hopefully, when we recover, write to. Like, god forbid that happens, but at least if it does, there's some amount of This this this runner board's made out of wood. Yeah. There there's some amount of insulation

from from

that. 1 of the other interesting

pieces to discuss is because of the fact that you already had a running system, you already had this large volume of data. As you mentioned a couple of times, the 52 terabytes worth of events,

what was your process for introducing

ClickHouse and Snooba into that architecture

and replicating the existing data into it while still maintaining uptime and still servicing end user requests so that they didn't notice that cutover in the introduction of this new system except in a positive sense? Yeah. That's a that's a good question.

So I think

I think this entire

process

from

start to, like, being at a 100% alive probably took roughly a year,

give or give or take a few times. Or 2.

And during that process,

I don't recall that we had any outages related to this or any down like, not even outages, like, planned maintenance

related to this as well, which is kinda wild.

Because, yeah, like we talked earlier,

people use Sentry to monitor their applications.

If the monitor is down, you know, that doesn't

reflect particularly well on our quality of service and also leaves all of our customers in the dark. So

what we ended up doing is

we were we were both dual writing as well as dual reading to

those legacy systems that we discussed earlier, Redis, Postgres, all these other things. We were running with those, reading from those while we were spinning up the the new SNUVA slash QuickHouse cluster.

So we were we were doing that for for probably months towards the tail end of this process.

With that, what we were doing and I don't think this is a particularly

novel approach, but it does seem like it's something that

not many people talk about, at least from from what I've seen. Maybe that's just

people consider this aspect less interesting than talking about the shiny new tech. But

we

earlier, we we talked about having these sort of abstract service interfaces, which came from just the flexibility of it being an open source project. We built this thing called the the service delegator, which basically delegated requests to these services in the background. So we could

run a request against the TSDB,

and we could, you know, run against the legacy system as the primary.

And in, another thread, we'd also execute that same request against Snoop.

It would wait for both of those requests to complete, then it would shovel a bunch of data into Kafka that we, would actually then do some analysis on and write back to a different ClickHouse cluster where we then do analysis on that to figure out,

essentially, how similar the results from from both these systems were. So this allowed us to test, you know, different schemas on the ClickHouse side. We tuned a bunch of settings.

There's basically no risk with that because

worst case scenario, the the ClickHouse cluster would back up on load. We would just turn the switch down, let it drain,

and, you know, change that configuration setting again and and spin the the reads up in the background and and figure out if that solved the issue. So we went through a bunch of different

iterations on the schema,

on, you know, all these configuration parameters to figure out basically what was the best fit for our needs.

This also allowed us to get a lot of operational experience in production without the fear of any sort of, like, user facing incident or outage.

There were a few hiccups. You know, we we learned definitely some stuff operationally along the way, that have had this has been something where we had, you know, the,

like, light switch analogy where we just go like, well, we we think we tested this really well. The test suite passes. Let's flip it and see what happens. I I think we probably would have would have had some fairly

embarrassing

results coming out of that.

So, yeah, there was a there was a lot of just kind of plotting

of these similarity scores between these different back ends, identifying

identifying these inconsistencies,

like, the sequential

reads that I mentioned earlier showed up as part of this process where we realized there is this

sort of interesting distribution when we were looking at the results from our time series database. Basically, the

the output that we're seeing when we plotted these similarity

scores, which were normalized from basically, like, a a 0 to 1,

we'd see this bimodal distribution where

there were

roughly half of the results that were exactly the same and roughly half that were not at all the same.

And took some digging to figure that out and realized that was because this automated system was making a request as soon as it thought that the request had been, persisted.

But it had been persisted to the legacy systems, and it hadn't actually made it all the way through the Kafka pipeline, so to speak, into ClickHouse.

And so it was issuing requests against ClickHouse that either weren't on the server it was requesting

them from or that just hadn't even made it all the way through the processing pipeline yet.

Yet.

So that's also something that we wrote a particularly

long winded blog post about, as part of this field guide series. And it kinda it kinda walks through that process in a little bit more detail. But, basically, by the by the end of it, we had been we had been running for a couple weeks, I think, at a 100%,

both the old system as well as the new system. We had all this consistency monitoring in place. We haven't seen anything

particularly aberrant.

And we,

just basically,

you know, cut over from the the secondary to the primary, and

some queries got a little bit faster. But, otherwise,

it was it was essentially a nonevent at that point, which is

probably the most relaxing launch that I've had of, like, anything in my career, I think.

Yeah. That's definitely always a good experience.

Yeah. Yeah. It was it was very low stress,

which doesn't always happen in kinda, like,

infrastructure engineering land.

And now that you have cut over to using Snooba and ClickHouse

and you're storing all of your event data in this new storage layer, what have you found to be some of the downstream effects of being able to have this greater reliability and consistency?

And what are some of the other

consuming systems that have started to leverage ClickHouse and the data that you stored there for different purposes that were either previously

impractical or impossible?

Yeah. So realistically,

we

set out to replace these systems, but, you know, in the background said, alright. Like, you know, now that we've moved into the new house, you know, could we start building some shiny things on top of it? So all of the products that we launch,

again, shout out to our blog, pretty easy now. All all of them are, you know, any graph that you might see, any search, any filter, any sort, that's all powered by Snoop and ClickHouse.

So all the features that we're that we're developing for

greater data visibility,

all of those are powered by this new search system that could not have existed previously.

Some of the other more, like, infrastructure benefits that we've kind of seen is that

deleting event driven data from Postgres,

was a a pretty easy query on paper, where you just say, delete where timestamp is, greater than retention.

And

that's great,

when you don't have to do any job deletions, you might be able to do some bulk. But if we're deleting individual events 1 by 1, we are literally doubling

the, amount of of commits that we and the amount of transactions, the amount of walls that we're writing to this to this postgres machine just to be able to not run out of space and to make sure that we actually delete our our customers' data when we said we would. So now in ClickHouse, since it's a columnar oriented storage, we,

I guess, like, physically segment all of these files across weeks.

And and once 1 of these segments roll out of retention, we just delete the folder.

A running drop partition

takes

seconds, and it just is an unlink, and it's 1 of the easiest things we've ever done. It is

a great thing to say, hey, what happened to the data? It's like, oh, it got deleted. No notice. That's that's fine. 1 of the other pieces of resilience is that

whenever we wanted to upgrade some of these other machines, whether that be, you know, REST or Postgres,

or even React, oof, Upgrading processes

were a very long amount of time for operations. We would all get into the same room. We have task plans. They would be, you know, you know, checking off these boxes,

you know, shuffling these titles over here, turning this off, turning that on, you know, crossing your fingers, PG bouncer is good, alright,

we're good.

But with ClickHouse, you know, when we have these, you know, 3 by 3 shards, realistically, all we're doing is just saying, alright, bump the version,

restart it. Alright. It looks cool on this machine, push it to the other 8. There have been times where we rolled back because of, you know, maybe various settings that we missed to set, maybe something that might have been baked into a default that realistically should have been. I think in my in my time of upgrading, I probably upgraded about 12 times to whatever releases they've cut. I think maybe, like, 3 or 4 of those have been rolled back. But it's really just, oops, that didn't work, and nobody noticed. Well, hopefully, nobody noticed. You know, I kinda saved that for maybe a Saturday or Sunday because our traffic's pretty US cyclical. So if you're using Century on the weekends,

I don't know, go back to watching TV. But realistically,

in terms of our in terms of our downstream benefits, like,

it's it's the compression,

it's

it's all the ease of operation.

Realistically, it's just something that we don't babysit that much anymore.

Yeah. I I think, like, anecdotally,

the support load on, like,

why is this system doing something weird has gone down. Like Mhmm. The the data stores don't get out of sync in the same way they could previously. Data

comes through in a more timely fashion, so there's no, hey, why why does this count not match this other count? Because all the counts now come from the same spot.

It's also something where I think ClickHouse has kinda snuck into, some different places of our internal systems as well. All, like, the requests

that

come through Sentry,

both from

the load balancer

as well as all of the results of essentially, every request that hits our application servers also now now goes into ClickHouse,

which is something we had always talked about, you know, when

someone writes in and says, hey, I have this UUID.

What happened?

You You know, we can actually give them an answer without just, like, grepping through tons of log files or

just saying, well, it looks like, you know, there's a lot of rate limited activity, so probably happened over there. We can actually give them a conclusive answer.

So it's it's definitely something that,

we're getting a lot of mileage out of, and we're finding new uses for it,

routinely.

And the Snooba system itself, it is an open source project. Is it something that is potentially useful outside of the context of Sentry where somebody might be able to adapt it to different

search,

implementations? Or is it something that's

fairly closely linked to the way that the Sentry application is using it, and it wouldn't really be worth spending the time on trying to

replicate the functionality

for a different back end or for a different consumer.

Yeah. That's a really good question.

So as of literally today,

it it historically has been extremely coupled to our domain model.

It is less coupled to our domain model today,

but it's still pretty coupled to our domain model, unfortunately.

Most of the work that has gone into it has been motivated by our product needs. So a lot of the early internals are, you know, pretty closely coupled to to those original goals,

really just as a function of trying to

to get it out as quickly as possible so we could stop turning on Postgres servers

and dealing

with Postgres query planner related outages or transaction wraparound, those kinds of just, like,

scary nightmare things. So,

yeah, it's

originally pretty coupled to the the original data model.

That data model that was evolving, we're adding new datasets in the ClickHouse. Like I mentioned, those event outcomes that get logged into it. So it's

internally,

there's a lot of that refactoring happening to reduce coupling. I think sort of the the logical end state of that is something that is more abstract.

But as it is right now to to add a new dataset,

dataset is sort of our analogous concept to table.

You essentially

have to fork the project. All the DDL happens in code.

We'd love for that to be sort of

more declarative,

whether that happens via configuration or or some other means where you're not getting in there writing Python and, like, literally editing the project.

That would be, I think, where we'd like to see it go. Also, right now,

the the code is pretty stable. Documentation

is

definitely a work in progress.

So that that's something I think for

anyone who wanted to to experiment with it. We need to invest more time in in making it easier to do that.

The code itself, though, is obviously

it's it's on GitHub. Anybody is free to, you know, browse around, take a take a look at it. You know, we we

do welcome contributions.

Although, the the right now, the

the barrier to entry is is definitely higher than we would like it to be in the long term, I think.

Are there any other

upcoming

new projects or any other major data challenges that you're facing at Sentry that are either causing enough pain that you need to do a major refactor or anything that is forward looking that you're able to spend time and focus on now that you freed yourself from the burden of trying to maintain these multiple different systems and their,

lagging consistency?

Yeah.

Solid question. So we really did just as Ted said, we really did just cut over from

that a dataset that we call outcomes, for for all the data that is in, this main ClickHouse cluster. Those are all accepted events, but, you know, we have a certain amount of outcomes that that do happen. You know, we have,

an event that might be filtered, an event that might be dropped due to a configure rate limit, or if someone has gone over their product quota. We have invalid events as well. We have a pretty

substantial abuse here from people that just keep sending us a mess even though we told them to stop. And all of those counts,

used to go into kind of the same TSTB model for all of the other dimensions that we had. And honestly, I think just this Tuesday,

we finally stopped writing about

300, 000 increments a second into

that dataset for Redis,

and it's been all being written into snuba at this point. Moving forward,

yes. We can't can't give away the full product road map, but,

you know, we've had a lot of, like,

draws from the idea tank and say what happens if if Sentry starts storing more than errors and crashes.

And, all of all of ClickHouse is kind of is kind of centered around that at this point. So doing a good amount of internal testing. If anybody wants to read some source code changes, they could probably learn a little bit more about that. But I guess good luck on that reading. We do we do okay at documenting

our pull requests and what they're about, but there's a lot of lot of chatter on on internal on internal lines that someone might not have a lot of full context on. But Yeah. You could read between the lines and figure out where where things are going. So it's we're we're not particularly hiding anything even though we might be a little bit cagey about it. Yeah.

Yeah. 1 of the other things that's sort of like a challenge

moving forward beyond just like new types of data is

so I I mentioned the search back end before and how the search back end was basically

querying a bunch of different Postgres tables. Those tables eventually started to live in different databases.

And joining across those databases is hard.

That's still basically a problem that exists today, except now rather than joining between a Postgres on 1 host and a Postgres table on another host, we're joining between a Postgres table on 1 host and ClickHouse, you know, in a completely different cluster.

So 1 of the things that we've been experimenting with is implementing a change data capture pipeline,

essentially replicating our data from Postgres,

via the logical decoding interfaces and writing it through Kafka into into ClickHouse. And so ClickHouse would actually act

as a re replica for this data that exists in Postgres.

We haven't really stressed

tested any sort of

complex join clauses yet, so we're gonna kinda see what happens with that. But, ideally,

if we can have

these

these cross database joins effectively

not be cross database and have that data be colocated,

potentially,

the compression works so effectively and the the subset of data that we have in postgres that we need to join on is so small that we can probably have full replicas of these Postgres tables that are used in search on literally every ClickHouse node. So it'd be a local join on all these servers, which would

be wildly better than passing around large sets of IDs like we have to do in some cases now.

So that that's definitely something that is kinda

on the on the road map that we we are looking to figure out. That also, in accordance with sort of our general philosophy, that's gonna be available on GitHub for people for people

to experiment

with on their own as well.

Are there any other aspects of your work on Snooba and using ClickHouse and just the broader context of search and storage at Sentry that we didn't discuss yet that you'd like to cover before we close out the show?

Shameless plug. We're hiring.

If if anything sounded interesting,

like, find me on the Internet.

But that's about it.

I I guess 1 thing that we didn't really talk about

with, like, future work is

right now, as as a product, Sentry basically provides a lot of

it it it gives you access to a lot of data, but it doesn't

kind of there there's no opinionated

insight

that comes along with that. So,

you know, there's there's definitely value in, like, opening century looking at your issue stream and seeing that a thing

is is particularly

bad relative to history or whatever.

That doesn't give you any sort of qualitative

information about,

like, if this number goes up, does that mean this thing is bad? Or, you know, there there's there's basically a whole lot of context that's missing there. So 1 of the things that we we've been talking about is trying to, like, supply more more context as well. But at least as far as the search and storage team, like, beyond just

the the rogue acts of, like, literally searching and storing, I think the

the ability to get more insight from the data is something that, you know, is is kinda on our roadmap that we're gonna be tackling. Because now now the data is where it needs to be to be able to to do more interesting stuff with it. I think, an analogy that we kind of have with Century right now is if you were to walk into an American diner and get a get, like, a 10 page menu

And, you know, the the waiter walks up and says, hey, how are you doing? You say, oh, what's

what's good here? And then he goes Everything's

everything's good. You're like,

sure?

So, hopefully, we can stop,

searching and storing, and maybe start suggesting in the near future.

Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I'll start with you, James. It

is very

difficult to scale up from

someone's individual laptop kicking the tires on a database to what very well could be a production shape. There are literal teams that might be dedicated to a company that just have the ability to assess the impact of someone's query that we're they're writing. You know, how how can we go from a laptop and say,

that's bad, that's blessed.

So right now, I think 1 of the largest gaps in terms of tooling is being able to assess impact

from going from, you know, a single SSD to what might be a data warehouse.

Alright. And do you have an answer, Ted? I I think kind of kind of building

on James' answer to some extent.

The ability for us to have

a a good understanding

of

when this particular

query is executed,

how

how many resources is this gonna take? That historically

was something that we had

a lot of issues with in Postgres land because

the

the query planner would, you know, do some predictive things.

99 point some number of 9% of the times, it would do the thing that you expected and wanted it to do. The other, you know, some percent, it wouldn't be. And,

there

like, with Postgres with Explain, there's, you know, a lot of insight about why

it shows the particular thing it did, but there there was no, like, kind of general model for being able to say, hey. I think this this query will cost too much. Like, don't run it. I know that there's

been some amount of research that's happened around that.

I have not directly

seen any of it. I think

having

that sort of

cost estimation and, you know, being able to do more effective

kind of rate limiting based not on just, like, quantity of queries, but cost of queries would be something that would be interesting to see in more of these,

like, I guess I would just say, like, consumer grade databases.

Something like that would be really interesting. With ClickHouse,

we we have extremely predictable quarterly costs because it's just basically scanning. Like, every everything is gonna be scanning something, and and the the cost is particularly easy to estimate from there. But as things get more complicated, there's gonna be more more variance

in in all of these run times, basically. And it it'd be nice if it was

something that was

a little bit easier to

account for in a way other than just the volume of queries.

Alright. Well, thank you both for taking the time to join me and share your experience of building and deploying the Snooba system and your experience of working with ClickHouse.

It's definitely

great to get some of that insider experience

and discuss some of the pains and successes that you've gone through. So thank you for all of your time and your efforts on that front, and I hope you enjoy the rest of your day.

Thank you. Thank

you for listening. Don't forget to check out our other show, pod cast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links