System Observability For The Cloud Native Era With Chronosphere

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Of your own. And don't forget to thank them for their continued support of this show. Today's episode of the data engineering podcast is sponsored by Datadog,

the monitoring and analytics platform

for cloud scale infrastructure

and applications.

Datadog's machine learning based alerts, customizable dashboards, and 400 plus vendor backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting.

By combining metrics, traces, and logs in 1 place, you can easily improve your application performance.

Try Datadog free by starting your 14 day free trial and receive a free t shirt once you install the agent.

Go to data engineering podcast.com/datadog

today to see how you can unify your monitoring. Your host is Tobias Massey. And today, I'm interviewing Rob Skillington about Chronosphere, a scalable, reliable, and customizable

monitoring as a service purpose built for cloud native applications. So, Rob, can you start by introducing yourself?

Hey, Tobias. Yeah.

Thanks for talking today. My name is Rob Skillington.

I'm the CTO over here at Chronosphere,

which is a cloud native and scalable

monitoring and observability

tool. I'm looking forward to kind of talking about

data monitoring and and everything

else. Do you remember how you first got involved in the area of working with data and dealing with metrics and logging and all of that and fun stuff? As a developer,

you know, going back

to my very first, like, few jobs,

is obviously

kind of observing

and collecting data about, like, how your application or or the data that you're storing is obviously always a core

competency. But I guess I got more involved

when,

you know, I was working on a reporting system that had to serve

reports for

different search terms

for searches that were done on a

kind of a a marketing inventory side for

architectural

products. And so this was a a,

marketing firm, like, marketing company that hosted several

catalogs online

for different products, and we built a lot of reports for them to for them to understand how people were performing searches on the sites and

what kind of terms led to this and that and then, you know, just a fundamental

the amount of data that we had to collect and then aggregate to

give any meaningful sorts of insights into that data was pretty interesting,

and we did a lot of work to to aggregate and roll up that data to make that product experience worthwhile.

That was something that I worked on while I was at university still.

And then when I went to Microsoft

after leaving university

in Seattle, I I kinda worked on

monitoring for the Azure Active Directory team.

You know, it was obviously on a whole different level of scale,

and

I kind of got to know a bit more about the pipelines there and

how the different teams and different business units kind of, like, exchanged data in general. And then, obviously,

I was actually kind of working on some observability

some monitoring and observability problems over there at Microsoft as well. So that that was the first 2 kind of ditched the toes in the water.

And so I know that you also spent a chunk of time at Uber, which led to your cocreation of the m 3 storage engine for metrics there. But before you dig into that, I'm wondering if you can give a bit of an overview about what it is that you're building at Chronosphere and your motivation

for building a business around monitoring and particularly the focus on cloud native?

So I've definitely at Uber was where I like to say that we were given the opportunity to

really dig

our fingers into

a pretty thorny

problem that the business was facing,

which was

operating

at a level

of scale with the amount of

real time data that was flowing through the business and giving them the foundations

to be able to make sense of that and also be able to reliably operate their systems.

How that kind of happened was I joined the team after working on actually some

payments infrastructure

and that I followed a few friends along from to Uber,

which had a lot of similar problems.

When a payment system is down, you have a lot of very frustrated merchants because

they're literally losing money every second that you're down, unable to run their business and take credit card transactions from their customers. And then at Uber, you know, a third level of reliability

was kinda similar to that. Like, if you were down in a city for more than 10 or 20 minutes, there were a lot of people out of work,

and

that caused very real brand impacts

as well as, obviously, put people, you know, out of their jobs for meaningful amounts of time in the day, which was just a huge problem that we just couldn't kind of risk that kind of constant level of reliability problems.

So when I was kinda, like, working on that team on the marketplace

matching system that I joined Uber before, after about 6 or 9 months, you know, when we'd kind of rewritten the Node. Js

dispatching services into more robust set of microservices

for dispatch. I got the chance to kind of look around and see what else I wanted to help the business with and metrics,

both

assistant level as well as operational

level metrics

that was used very widely by the rest of the company was something that was causing

both the

inaccessibility

to really use them at the level that folks wanted to as well as the the amount of scale problems that the business face to continue to make that a tool that was actually useful to developers at Uber

was was something that I wanted to dive into and spend pretty much the next, kinda, bit years kind of solving there with my cofounder,

Martin.

In terms of the challenges that are inherent to monitoring, I know that there are a lot of usage patterns that are unique to metrics

and log data that aren't necessarily represented

in other types of data use cases. So I'm wondering if you can

just talk about some of the complexities that arise because of the ways that monitoring information is used and the patterns around that

and how the advent of cloud native environments and workflows complicates things further?

I think that, basically,

when you talk about, like, storing data

and the amount of information that you store,

a lot of people tend to think, perhaps IoT is massive scale. Perhaps,

you know, like, exploring

the data collected by lidar on a self driving car is massive. And while those use cases can be large in individual

deployments, they're not at the scale of how much an individual computer and even your laptop can emit information about it in a very high frequency nature.

Your mobile phone is running hundreds of processes

and sending

tons of data to the Internet every second.

And so it tends to be actually, like, outside of all these, you know, use cases that we're thinking are on the verge of, like, causing

large volumes of data for us to collect and analyze.

While that is true,

software itself is probably the leading use case for both

constructing

information about how it's running. And then the volume that that is at is tends to be 1 of the highest volumes of

information

recording that we're seeing in the real world, at least, generally, I think most people are in agreeance that just the level

of scale and granularity that information is kind of can be collected at and looked at is is rather massive.

And so, you know, as you kinda mentioned,

how to

even harness that

is is tough and difficult.

There's

so many intangible things about software, and so you kind of have to limit your scope to, you know, what kind of things do I want to actually pull out of this infinite sea of data that could be generated about a piece of software running on a device somewhere.

And, yeah, today, we extract 3 main kinds of

data, which is log like data, so something that you as an application developer

wrote to capture about an event that your program did that you wanna look at later.

Metrics, which is more like a signal

that could be kind of omitted from your program in a more type safe way, I would say, than logs. You know, you really you you've got to choose a type of metric, and then you've got to choose

the different

specific dimensions that you wanna be able to pivot on that metric.

It's kinda less free fall in the logs, but then by the very nature of you having to think about what that metric is to expose it, can be more meaningful to you as someone when kind of, like, looking at metric data or first log data because you already put in the thought to describe kind of, like, what you're measuring

with metrics.

And then tracing is a very interesting kind of

compound view of both that event data that's kinda happening throughout your system but then tied back to an actual individual component

that's performing it and then being able to visualize

the events for a given kind of request or a given kind of

flow of a user

using your system as it crosses the different component boundaries. You know, in the cloud native world, we think of most of the time those boundaries are bits of code executing between different microservices

or back end services,

and that's kind of where, you know, the

component is is sought to be captured at the granularity of an event being attributed to a component. But, you know, tracing can also be used, obviously,

on a mobile device on a mobile device to look at, you know, a complex piece of iOS or

Java software is running on your Android or or iPhone device as well because you have many code libraries, obviously, that you call into from your application, and then, obviously, the operating system is doing things. So so tracing is kind of a way of, like, looking at a call diagram

across component boundaries, and those component boundaries are different based on what you're kind of observing.

And the tracing

aspect of things is definitely relevant, particularly when you're working with distributed systems, which is a common problem domain in the data space where you want to be able to understand

what are all of the interdependencies

between

these systems

based on just this 1 request and also with the case of microservices

and the advent of cloud native kind of

expanding the availability of that as an architectural pattern.

You might have 5 different teams, each supporting 5 different services. And

unless the top level system architect, you don't necessarily know what are all the interconnection points. So if you're trying to debug a problem

in a service that's nested deep in the stack, you need to be able to understand what are all the systems that it traversed and what were the actual function calls that happened along the way to be able to even comprehend

what might have gone wrong.

Right. Yeah. That that's where tracing is great at helping you orient around

a problem and kinda start to

understand the

complex

relationships and code paths that are kinda executing

for in a in a very black box way. Yeah. I think, like, as we obviously continue to develop more services and more, products in it on top of cloud native infrastructure

and systems in general,

yeah, it's going to be a fundamental building block that we're all gonna be pretty reliant on in the near future.

In terms of the actual metric storage,

I know that, as I mentioned earlier, you helped to create the m 3 storage engine at Uber to be able to handle the scale of metrics that you're trying to deal with.

And I know that there are a large number of different time series databases that are on the market. They all have different optimizations that they build in and different target use cases.

I'm wondering if you can just describe what were the pieces that were missing in the overall

ecosystem of available time series storage engines

that made you feel that it was necessary to build a new 1 from scratch to be able to solve your particular problems?

Great question, and something that was not an easy

decision to make. As you mentioned, there is so much out there today

accessible for collecting and storing

time series like data. I guess my answer

really starts with

kind of understanding

why

there are are there so many different types of

kind of time series like or

specialist databases out there today, right now. And I think the main reason we're starting to see this

kind of I'm not gonna call it explosion, but,

you know, I think there's these ebbs and flows of,

like, a new problem appears and multiple solutions kind of enter the market

to try to solve that problem. Then there's a natural consolidation point as well after, like, a few start to have more experience with it. And, you know, I think much like some folks who

reach for NoSQL solutions for almost everything that they were building a few years ago, I've naturally fallen back on more

relational like databases,

whether they're distributed or not, like Postgres

or or something more distributed like CockroachDB.

I think, like, similarly with, you know, the high

volume of of data that we're trying to utilize today, much to what we talked about just earlier about, there are some of these specialist use cases appearing

that there's a very real need for

ways to solve those problems.

And, you know, back when we were looking at time series databases at Uber after say

the

say the most obvious 1, which is

since it was a central

metric store for us

for operational and system

metrics as well as some business metrics for for, like, real time business metrics we looked at. It fundamentally didn't have the level of reliability we needed. And so, you know, if we actually wanted to expand

the Graphite Whisperer

back end, we would have to take hard downtime

on a subset of the metrics that we were collecting because there's no, like, Kafka,

you know, in that pipeline that could, like, buffer the data off for you and then write it like that system is back again. And there's no replication, so it's, like, fundamentally, a single replica of the data. When you're trying to expand 1 region of that metric space, you take that whole percentage

of that metric space offline,

both for reading and writing. So I think the problem that we faced was

how to reliably

store telemetry data in a way that wasn't gonna have hard downtime, that was more reliable,

and could also service

the the growing

set of cardinality

that we faced

during our move from,

like, physical on prem processes,

to

containerized workload

and which generates just so many more smaller units of compute that have to be tracked

as well, forming much more higher levels of relationships between the metrics because now, you know, your lowest granularity, instead of, like, a large physical server with a host name with 48 cores, it was a container with 2 CPUs. So you would naturally kind of, like, have

24 times

the number of logical

units of compute. They were just smaller. But you still needed to track the metrics

to each 1 to kinda make sense of how

things kind of were, you know, operating as well as the the health and the units of work they were all doing individually.

So

that was the major reason that led us to even just looking around at the problem space and seeing what else was out there. And then at the time, you know, I think, like, there was,

much like there is now,

a lot of different things doing trying to solve different problems. And

none of them seemed focused on this

computer software telemetry

horizontal scale out story.

So, for instance, like, you look at InfluxDB

back in the day, that was much more of a general purpose

time series database.

It was more focused on, like, offering different things you could do with it rather than a horizontal scale out story. And if you look at, like, all the early versions of it, you know, some of the ways they describe scaling out InfluxDB was you put, you know, a Kafka over here, and then you replicate that Kafka data between, like, multiple nodes, and then you partition the data in front of it. So it wasn't really, like, at that stage, you know, trying to solve horizontal scale out. ClickHouse was and still is today more

based around processing,

like, you know, a bunch of, like, vent like data. It's not really focused on metrics.

And so that, while it has some ways to scale out, it it also has just fundamentally kinda, like, wasn't really built for metrics

as natively as other solutions are like Graphite and Prometheus. And Prometheus was really interesting at the time because it was had just kinda entered the scene in 2014 in open source world.

However,

you know, it was fundamentally and still to this day describes itself as something that is very focused on

a the a single node experience

and

focusing on they're not focused on the horizontal scale out story of data at the individual Prometheus server level.

And so,

yeah, I mean, there were a bunch of other ones out there as well that I could explicitly

mention,

but what it really came down to was, like, this data needs real time access.

You know, if you can't monitor something that

broke and get a signal on that within a minute of it breaking,

then it doesn't matter because that's the whole purpose you're using it for in some deployments.

And

so really need something that has fast

or multidimensional

data. And, you know, a lot of folks were kinda using Elasticsearch for that, but then, really, Elasticsearch is better again for structured events, not like metrics.

And so

we needed something that had a fast inverted index so we could do these multidimensional

queries

very quickly

that didn't lag

and could do real time alerting,

and that also was schema free. So unlike Druid and other things, you know, we needed it to be schema free to be able to let developers continue to instrument the code the way they were today without

not really thinking

immediately about how that, you know, data was gonna be structured and stored in a database. They just need to be able to write a few lines of code and just assume that the backing

telemetry telemetry store can deal with storing that and enable you to query for it later

in some reasonable query pattern. So I would say it comes down to the fact we needed a schema free

telemetry store that could support real time alerting

and allows

developers to get quick insights into

how their software

or business use case,

was performing

in their code.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting. And often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask.

Definitely wanna dig more into the concept of cardinality and the challenges that that poses

and

the concepts

and sort of best practices around

data modeling

or sort of schema management or lack of management as the case may be for this

type of data and the workloads that it supports where I know that kind of the death knell of a number of different metrics engines is high cardinality data where

you can maybe tag something with the host name and the metric name where it's, you know,

FU host

CPU 1 frequency.

But then if you need to then add additional information about

the host and the application and the function call and the timing information,

and you might have, you know, 8 levels deep of cardinality, or you might want to have

a top level tag, but then have additional other tags or meta information associated with the metric

that just causes all of these systems to

buckle because of the fact, as you said, you need to be able to have them indexed and available in near real time.

I'm wondering if you could just talk through some of

the structures and the approach that you've built into m 3 for being able to support those types of workloads in this near real time need

and in this largely write once read never type of workflow?

That's something that, yeah, a lot

of folks

have to deal with, and it's not a fun problem to deal with.

I think, like,

when you first get started,

as I kinda did when building out services for

the dispatch team early on,

you start to see how powerful just metrics in general are both for,

like, measuring,

you know, features and, like, to see how different code paths are performing, how many people doing different types of call types, yeah, divided by

which dimension on, like, a type of

product or using a different type of, like, mobile operating system to call your back end and stuff like that. Like, all of that, I think it's very easy to see the power of it, especially when you're using, like, a client library that lets you, like,

instrument quickly and kinda see the data quickly.

But as soon as you run into a cardinality problem,

you start to have to put in extra work to really

understand,

like, why is, you know, your queries for this type of metric data suddenly become extremely

slow, and understanding the why

is something that, you know, kinda degrades from the magical experience you had

of just being able

to instrument and get answers about things quickly.

The scheme of free thing is, you know, really attractive,

but then you run into this cardinality problem quickly when, like, you don't think about how adding an extra dimension to the metrics can explode the space

of the different permutations

of types of data that you'll actually wanna look at now and query over.

So

I think the things that Chronosphere

and using m 3 as well, you know, behind the scenes kind of try to face this problem domain is

shifting

it away from you having to fully understand

why that cardinality

problem

exists.

You know, the first few things that we try to do there is

give you the levers at least to correct certain telemetry signals that have too many dimensions on them. And so

the example I really like to use here is

you have some kind of web app, and you run it in 2 regions in a cloud provider, say, and then you have 2, say, mobile clients or that could even just be different, you know, JavaScript bundles, so, like, web app versions

that both access that back end. And you could imagine that

1 client version,

you know, hits like a front end service

and then talks to an API server, and that talks to MySQL to get some results for, say,

a search result. And then you may have another client app, so let's say Android versus iOS,

where maybe the Android version also has, like, a

support chat feature to it, which talks to Redis to get, like, the latest messages for you. So

this iOS app is calling your web app accessing MySQL,

and this Android app is

accessing the web app but talking to both MySQL and Redis because MySQL is searching the search results, and Redis is seeing if there's any new messages to display

in the app. And so, typically, you can imagine,

what if your Redis fails in this this app, but it only fails in 1 of the regions that you're operating in, so say US West instead of US East?

So

the best way to monitor your software is to really observe what the user is seeing. Right? So because, like, you could honestly monitor

thousands of things that are happening in your system, but the ones that really matter are the ones that, like, literally cause, like, an error bubble to show up, like, on your end user's device. Right? So monitoring from the

edge, you know, you probably get an alert that you've you're seeing some internal server errors or 500 responses being returned from your web app. To actually start debugging

this problem, you need to know the HTTP route, so, like, the API slash search HTTP route. And you need to know the client version, the fact that it's an Android client, and you also need to know what region it's in, which is, like, US West.

You start to add up, like, some of these dimensions here. You wanna search on the HTTP route, and maybe you have, say, a 100 HTTP routes that your web app, you know, which has been around for a year or 2, now has. You wanna alert on the status codes. So let's say there's, like, 5 major status codes, you know, 2xx, 4xx, what have you. Maybe you're running in, you know, 6 to 12 regions because you're deployed

close to where these devices are running, and maybe you have, like, 40 different client app versions because maybe it's not just iOS versus Android. It's actually the actual type of version as well as the platform that's calling you. So these are the type of dimensions you want to start to think like, hey. Oh, it's a search endpoint. Oh, it's Android v 2.

Oh, and it's in Ye West. I'm also seeing that Redis is unhappy. It's correlated to this code path. So, anyway, to get that level of dimensionality

on your like, how your requests are performing at the end at the edge,

say you had some of those numbers I was talking about, a 100 endpoints,

5 status codes, 12 regions, say, and 40 client versions.

That's

240, 000

unique time series,

and so that's a pretty high cardinality metric that we're talking about, and that's just capturing status codes

at the edge of your entire web back end tier,

so at the edge of where your environment finishes.

But, you know, there's things you can do to make that quick, which is, like, remove dimensions off of it, but then that makes, you know, the alert or the data that you're looking at much less valuable too. And so,

also, imagine if, you know, you wanna put in the country where the user is calling from. Perhaps you have some different logic based on the user's country.

You know, that's a multiplier of 250. So now you're multiplying

240, 000 time series by 250. And so I guess, like, it's very easy to get into these, like, carnality explosions,

and that's kind of, like,

1 of the better concrete examples that, I like to talk about.

And a lot of people, I think, find you know, have derived a high amount of value from kind of adding like, at some point, it gets a bit ridiculous adding on every nth degree dimension, but some of these are not, you know I don't like, we talk about HTTP route, status code, region, a client version, and maybe, like,

some data about the the market that you're running in. It's not like a crazy ask to have those kind of dimensions.

So, yeah, moving a little bit onto, you know, how we think about solving this,

it's really about

kind of giving users

the levers to

instrument first, not worry too much about cardinality,

and then be able to actually make sense of the data

later. So 1 of the big things that Chronosphere and m 3 do does is, you know, provide a both in front of the time series database, we have a streaming

metrics aggregator. And you can think of that similar to

platforms

like Apache Storm or Flink. And what it's doing is it's transforming the data as it's being emitted and

doing that by, you know, acting on these messages

in memory and then computing aggregates and then passing those aggregates on to the time series database. So you can imagine that by default, a lot of this data that I even talked about just then

might have, like, a container

value on it. Now if you have, like, a 100 containers deployed, now you're taking that 2 40, 000 time series, and you're multiplying it by a 100 again to get into the the many millions.

So sometimes people say, I want that level of granularity, but I actually maybe for latency or for other types

of data, I don't actually need the container level instrumentation.

I want the container level CPU and stuff like that, but I don't need, like, the crazy

cardinality on their quest metadata.

It's really easy in our platform to kinda, like, profile the metric stream as it's coming in. We do weighted sampling to kinda show you, and we use reservoir sampling, the the weighted algorithm, to kinda show you, like, what the stream of metrics coming in by its dimensions are and lets you, like, group by different tags,

other dimensions, and then kinda slice and dice by the highly unique ones or the ones that appear on everything with a low frequency of unique values so you can kind of, like, divvy up the metric stream. And so that lets you kinda, like, see the shape of your data, and then you can, like from a few clicks from there, you can start to create pivots on that. So derived

aggregated metrics

that can be that at streaming time are pulled in and aggregated and give you much, much faster roll ups on these views

so that you can both alert on them, but also see, like, you know, look back 30, 60, 90

days or a year, and the graph actually loads instead of going to have to process,

you know, hundreds of terabytes of raw data across those

millions and millions of unique time series to to actually give you a response there.

So a lot of what we think about with, like, m 3 and and and Chronosphere

in general is about

coming back to that problem

of, like, to give meaningful fast results just like I was talking about on that the architectural website I worked on ages ago, this data has to be aggregated to be useful, and you also need to find

the slivers, you know, of those time series quickly.

It's not fast enough to grab through them like you would with log or structured events. The other thing as well is, like, if there is some industrial use cases you wanna onboard

that are high cardinality and you wanna keep the raw data as well, you need to be able to horizontally scale that out quickly. So, you know, if you're kind of, like, pulling data into these silos and different monitoring databases that aren't connected to to each other,

it becomes really hard

to ask a question

about, like,

over high cardinality data

that kinda where the data lives in all these different systems. So much like the benefit of HDFS,

you know, brings to big data and, like, a data lake typically brings

to businesses by kind of centralizing that data, being able to reason on it in 1 place, We wanted that ability

by making,

you know, m 3 and and Kronoshear as a scale out story

for you so that, you know, if you if you did want to

quickly double

the capacity

of time series data you wanted to actually look at, you could. You could also use rules so that you could keep some of them for longer than others because that was another thing. Like, you really wanna be able to pull levers and say, like, this data, I care about these dimensions.

This other data, I care about keeping raw, but only for x number of days. So all those use cases that I've kind of just talked about there wasn't really available on the market. And to this day, you know, I still think of these all as, like, fundamental building blocks to be able to make sense of data at this level of scale. And that's what we're all about, and that's kind of why we're here.

Digging more into the Chronosphere platform itself, can you talk a bit about how you've architected that and some of the features and capabilities that you've built on in addition to what m 3 offers out of the box?

Chronosphere is, kind of solving more of that mission, like I just talked about. M 3 really is, you know, a piece of infrastructure that is the building blocks

for doing what we're talking about, which is

being able to make use of an increasingly

higher level of chatty data that can give you much more interesting answers

than it could before because it supports arbitrary dimensions.

The Connoisseur product in general is about

adding smarter rate limiting in front of that data stream. So sometimes you wanna, you know, kind of customize these metric use cases yourself, and then other times, you just wanna kind

of use the platform as if it was an unlimited resource. But then

when, you know, you do something that is an extreme

explosion on data,

you kinda want the system to just look after itself. So what Chronosphere Cloud brings to the table is intelligence

to kind of

describe

your organization that's kinda collecting this data. And that can just be as simple as, like,

every metric that's coming from team a has team a's tag on it or label,

and team b has team b's label on it. But kind of, like, Chronosphere is, like, highly aware of that contextual link that you build and then can

essentially say, oh, okay. Well, if 1 of the applications that team a owns is starting to, like, emit way more data than it it did on a Friday afternoon

when some engineer, you know, deployed it, we're gonna clamp down automatically on just that kind of metric family

on that application for team a, and team b is not gonna be interrupted.

Most of team a's applications won't be interrupted either. You know, ideally, we'll just be dropping data from this kind of new set of cardinality explosion that happened from from that team over there. And then, you know, there's a lot of, like, the management

side of things that goes on as well.

All of the features that Chronosphere gives you is source controllable. So, you know, all of your alerting definitions, all of your

aggregations

that you're using with our metric streaming aggregator

can be defined locally. And we have a Terraform provider, so all your alerts can also be declared in something like Terraform.

So a lot of it's about, like, making it a service that fits in very neatly into your engineering workforce and or developer workforce and just allows everyone to

treat it almost like,

you know, the Kubernetes of monitoring, essentially.

We have a command line tool and and other things out there to kind of, like, help automate bunch of this stuff that you're kind of like s SREs and and other, you know, developers locally that are that are helping you set up things, do. Yeah. A lot of it's around information management. A lot of it's around

the ergonomics

of, like, running at this level, like, collecting this much data. There's a lot more that we're experimenting with, of course, as well

with kind of showing you a trace view of this data

with 1 click from a dashboard

and a data point and showing you you know, getting you from 0 to somewhere much quicker than is typical that we're also working on.

Digging more into the actual use case of metrics,

coming from somebody with an operations background,

my automatic default is thinking about systems and application metrics, which we've been talking about so far.

But what are some of the other types of metric sources that people are working with, particularly if they're in a data engineering or data science or machine learning context?

So some of the more interesting things,

you know, that we saw at Uber being used with metrics

was things that you I guess you typically wouldn't have thought that would be stored in a a system metric store. So for instance,

there were folks developing

features that kind of measured how long it took

to process, like, a user's request. I mean, a given

request for some

you know, for to, like, call an Uber at an airport,

and

folks could kind of just decorate their code with, like, timing information

and then quickly be able to, like, graph that

and give that back to operations

that were kinda, like, at the company.

You know, typically, like, you could build a feature like that by,

you know, capturing how long that dispatch took, saving that to a MySQL database

or some other data warehouse and then kind of, like, doing a more typical map reduce job.

But, a, it doesn't give you the data in a very real time nature, and, also, b, you have to go through the entire process of data modeling for that event

and kind of, like, making sure that gets into the data your local company's data pipeline, whether that's using Kafka. You know, you have to choose, like, a schema if you're using Avro or some some other structured

format to describe that event. And then, of course, you saw, like, you get a graphing view a lot of the time. You're stuck with, like, getting the SQL output of or some kind of MapReduce output of, like, a Hive query for that data. And so the turnaround time on getting some of those answers and then also being able to monitor on that data was just a lot quicker to do with metrics.

So, for instance, like, the mobile

app experiments

was a lot easier to kind of monitor

in an aggregate form in the in the metrics and monitoring store than it was in the typical data warehouse

for getting, like, responses on,

hey. I'm rolling out a new experiment. I just turned it on in these,

like, in these geo areas or with these client versions in the beta channel.

Like, you know, did things get slower in terms of request latency? Did

certain events like

kelp

support tickets being created go up or down?

So you can kinda, like, measure core KPIs and drill down on a per experiment basis on, like, you know, how is that experiment doing? Was that degrade? Like, was there likely a bug with that experiment that wasn't showing up in any back end metrics, but you could see from a divergence in the front end, you know, the how the product was being used in the mobile app and do that in in kind of in real time without waiting for a data pipeline

to kind of process all these events and give you anything meaningful a few hours later. So,

you know, that I thought was 1 really interesting. Both of those use cases are fairly interesting. You'd kinda typically think of them more obviously

being solved with a more typical, like, pure data stack,

but had a lot of value in

being replicated that data or even sometimes starting out in the metrics and monitoring store before being modeled

you know, to a high level of detail and then stored somewhere else for more,

like, deep analytical use cases.

Those are 2 ones I think that is definitely interesting. The other 1, of course, which you kind of alluded to was,

you know, the metrics and monitoring store being used for kind of measuring the performance of machine learning

models.

Both the models and the features,

you know, tracking things like their availability capacity,

utilization,

staleness,

checking, like, the online feature serving, how long that took, how much throughput of errors, or

other kind of signals throwing off about how accurate

the models themselves thought they were performing.

Things like that were really interesting and gave a lot of quick turnaround time

for people that were iterating on machine learning models

and stuff like that. And then, you know, for things like what kind of, like, the ability for us to just surge

our like, a scale out model and just add servers to this kind

of consumer grade back end. Right? Like, m m 3 and Chronosphere

is really just about being able to look at everything as something that can just be use consumer grade, like hardware to just scale out quickly and cheaply.

And so that kind of let us run things like monitor how was this how would different surge algorithms actually

form?

And so while

surge algorithms were running for different geohexagons,

like, a few different versions of what the surge algorithm could be tweaked to do would run at the same time and then emit their values

as a discrete, like, float value into the metrics monitoring system.

And, therefore, you could kind of, like, go and see how surge would have been at different points in time using a different algorithm.

You know, similar to the machine learning model case, just having that ability to kinda say, okay. I'm gonna, like, purchase a few more, like, computer instances here and collect this data. I only wanna keep it around for, like, 7 days or, like, a few days. But, you know, just being able to scale out that out and then scale that back in really easily was super valuable.

So, you know, the the the kind of, like, surge algorithms we're talking about here ran over. Like, they could touch any

kilometer wide hexagon.

So we're talking about tens of millions of hexagons. And then if you divide, you know, that number by

60 and you're collecting value every 60 seconds and there's only 10 or 20% of these hexagons ever computing a value for it, You get into the 100 of thousands per second range, which is fairly easy. Like, a single

m 3 node can do 100 of thousands of time series per second, the collection interval.

So now that you have the ability to, like, quickly scale up and add capacity to collect this kind of data,

like, the kind of things you could do and the ways that you could look at it, especially using open source tools like Grafana, become a lot more accessible

rather than typically, what you would be doing in other platforms a lot of the time is instead of being able to graph this stuff, you'd have to write large job, maintain a pipeline,

and do a lot of work to to kind of, like, visualize or make sense of those results as well, even once you got it in through that pipeline.

On the side of actually reading the data back out,

there's

the

fact that a lot of the times when you're issuing a read, you're only going to be interested in in a small subset of the overall number of metrics that are written.

But also, the reading operation isn't necessarily going to be a human behind it because a lot of the use cases

for storing metrics is for monitoring and alerting where you want to be notified if there are

certain patterns in the time series, or if you run over certain thresholds in a given metric, or if there are anomalies

in 1 or a grouping of metrics.

And I'm curious

how you handle

addressing that with low latency

and low enough latency that actually getting an alert is meaningful

and that you can respond to it in the context and in the timeline

that is

going to be reasonable for making sure that the system stays healthy and so that an end user

either doesn't experience an outage or that their experience of any sort of outage or downtime is minimal?

Being able to do that

is 1 of the major reasons you

react to an event quickly or be able to roll back from a problem quickly,

even when you're deploying extremely complex software

is the major reason why most people,

you know, invest in their monitoring and observably infrastructure.

You just couldn't really build

systems this complex,

you know, have this many kind of different accesses and code paths like, 10 years ago because,

like, you just would have no way of even working out what's going on unless you waited through tons and tons of log data.

And even then, like, the tools weren't good enough, so you couldn't even look at any of that data in, quickly enough or in any meaningful way.

So,

yeah, that is the majority of the challenge

of kind of, like, finding these subslices

of

time series you care about and being able to alert on them in very near real time, like seconds after

the events happen. And so, you know, a lot of what goes into that for us, it's both, like, solving that at a technological

layer. So, like, signing and architecting

things so that they can do very high frequency evaluation on this data. And then it's also, of course, about, like, operational excellence.

You know, if you have a monitoring system that's going down all the time,

then you're really only getting so much coverage.

And, yeah, you're kind of flying blind

for significant periods of time.

I think of the 2 axes in terms of, like, being able to do it at scale and then also being reliable enough that I can actually be be dependent on that. It it will guarantee tell you when there's a problem even if it's, like, this

tiny thing experiencing

an issue in the pretty large set of infrastructure.

And so, you know, some of the things that make this all possible for us is, obviously, we replicate the data

3 times

so that you can lose, you know, multiple instances of all your data and everything keeps humming along.

So that doesn't mean that you just have a, you know, a gap in your monitoring. You have to actively lose

multiple availability

zones

in a compute region that's storing your monitoring data to experience any any sort of real outage because the replication takes care of that for you. And then the other 1 is, you know, for actually being able to learn on these quickly, it's about having a fantastic reverse index.

Combined with

a streaming aggregator

that can really

compress those signals that you need to look at. Like,

it's not economical

to just process,

you know, the huge amounts of data that you're ingesting all as,

like, as a query time.

For instance, as I just said, like,

the example we walked through was

you have an internal server error being returned from an edge node

to a single mobile phone,

you don't really care which

container that came from when you wanna be just notified about the problem. Right? Like, you may care about what that container was as soon as you do get notified.

But for the notification to get triggered,

the internal monitoring system can peel that dimension off of your metrics. And so, you know, what we do is is obviously make it possible for,

like, looking at the query,

working out which dimensions you're actually querying on, and then form an aggregate view of that so that at least for determining whether the alert is being tripped or not, we can look at much more aggregated versions of this data,

which has

far a few distinct time series

that need to be loaded and evaluated

very frequently.

And so the very fact that we're doing that in memory and aggregating that on the way into the time series database

means that we have a lot of bandwidth on the the time series database itself

to do a large amount of evaluations.

So the other way to aggregate this data is, obviously, like, take all those container ID like, data config of those containers after they've been stored in the time series database

and then kind of squash them together into the aggregate view. But the problem is you actually need to issue a query against time series database to do that. So it's both about squashing data on the way in to the sets of dimensions you care about when you set up an alert. And so that gives us a much fewer distinct series, and then that reduces

the load on the time series database because now all you need to look for is, like, that much smaller set of time series and evaluate against that. And then the fact that we're doing the aggregation

on the way in means that you could have, like, 100 of 1, 000. And at Uber, we had a 150, 000 alerts set up that'll, like, could notify you within

seconds of when, you know, an event happened,

all set up against this database.

And so that's kind of, like, the major architectural ways that we get to, you know, having a highly reliable system that's always up that can also

collect a lot of these signals at very

high cardinality, but then give you an answer about when there's kind of, like, an expression

that's tripping. And then from there, you can go and look at the raw time series very easily by just clicking on that alert expression and seeing the raw underlying data and see which, like, container is misbehaving.

And then, of course, like, to to actually get to those few distinct time series that we're talking about, that reverse index and it being scalable is incredibly important because it allows you to take a multidimension

set of values,

map that into a postings list, and then pull, like, the exact time series metric query rather than having to go and do, like, a

scan or a search of, like,

a sorted string table or anything like that. So it's kind of like the very opposite to a pure column data store. You're really marrying an inverted

reverse index,

which is much like Apache Lucene and Elasticsearch is built on top of with a column store

that has a highly compressed set of the time series' actual

inverted index to go and quickly find which 1 of those, like, billions of time series to kinda present based on your multidimensional

query.

Yeah. It's definitely an interesting problem domain and a lot of complexity to dig into. It's definitely

interesting to see the number of different ways that this problem has

been addressed and attempted to be solved

and how each time there's been a sort of architectural shift in the ways that people are building and deploying their applications, it leads

to another generation

of

metrics engines and time series engines that are needed to be able to keep up with the growing complexity

and the number of different sources of information and consumers of that information.

Yeah. Definitely. And I think, like, how we're solving it is is really just

trying to give you, like, the magical experience

that you get when you first use these tools.

But, you know, solving some of these fundamental,

like, questions that we're asking of our software is, you know, I think, like, can be achieved in multiple different ways. You know, we're obviously optimizing 1 experience.

And over time, I think, like, yeah,

monitoring in general will change. Like, it much like civil engineering has, you know, certain ways

things are done, and are codified and have been codified over a very long amount of time.

Software engineering is codifying how we, you know, think and operate on building systems.

Yeah. Monitoring it and observability is fascinating why I'm happy to be doing this for 10 more years

because it is a very core

pillar of

how we build systems, how we will continue to be able to build systems in the future, and it is 1 of the most fundamental building blocks for

being able to actually build a system that works.

And in terms of the actual use cases

for

Chronosphere and m 3, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

You know, we have different

customers that are doing things like collecting a whole bunch of telemetry data about

physical storage systems

and and kinda, like, collecting them. So there's definitely, like, IoT use cases. We also have customers

like Tekton, who, I believe, was on your podcast just a few episodes ago.

Actually, if you visit our website, you can see their write up on how they use Chronosphere, which is for

basically

allowing,

monitoring

the feature store that they run and also providing other metrics to their end users of their machine learning platform.

You know, I think, like, that is definitely a fascinating use case.

There was,

yeah, a bunch of

times, where I've also seen this kind of, like, M3 in general used for

things like storing

a whole bunch of telemetry data from self driving vehicles, which I thought was pretty interesting.

There's a lot

of different

things that you would wanna kind of inspect about a dataset and and make decisions on very quickly that only a system that gave you very low latency access to time series data can do. And so,

yeah, those are a few of the use cases. But, you know, I'd I'd like just in general,

I don't know what you know, how you feel as, obviously, you're experienced in this space about running software and and kind of, like, reasoning about it, but I I love to hear your thoughts on this. I, in general, feel like there's a lot of

more higher

level concepts and

signals about

the very code people are writing today that are going into things like metrics that wasn't before.

Leading to some really interesting things, you know, like order rollback of systems,

telemetry about, like, how things are kind of communicating with each other, you know, outside of the data center.

So I think, like, some of those are super exciting. They're more generic

software

monitoring

patterns that are starting to appear. But, yeah, do you have any thoughts on the evolution of how that's changing?

Yeah. I definitely think that the

availability

of metrics engines, particularly as a service, has driven a lot of adoption around being able to actually instrument applications.

And

with the growth of containers

and the corresponding growth in smaller service sizes

and the rise of things like open tracing, a lot more teams are actually starting to adopt

things like data sampling and request sampling to understand how their systems are communicating with each other,

how their systems are interacting with external dependencies,

which was generally more of a black box,

bringing in more information from things like the database to understand

what are the latencies and query time, how can you optimize the code paths there.

Definitely a lot of opportunity

for being able to actually

bring that information

into the development life cycle to ensure that you can

improve your product

without having to waste a bunch of time

just trying to figure out what is actually happening,

why are these code paths slow, and just, you know, digging through code. You can, as you said, just throw a metric on something, release it, and then quickly get feedback on

what the behavior is.

There's definitely been a much bigger focus on observability as a first class concern of building software systems

than there was, you know, 10 or 15 years ago. And I think that it's definitely a positive trend and 1 that I'm happy to see continuing.

Yeah. Yeah. Most definitely. I mean, 1 of the major things

that we're experimenting with now is, like, native trace storage in v 3, which I think is really interesting. Like, typically,

back in the day,

a certain user was experiencing a problem.

You would start a giant grip on all your logs. And, you know, they kinda like like in your entire distributed system, like, where

finding the logs from each system was also you couldn't do, like, a query across all systems for logs for that user.

But with things like tracing

and being able to actually, like, you know, index certain fields on this,

like, say, for instance,

not indexing everything about a trace, but things like things that are important to you, like user ID and being able to reliably get a, like, a trace back from that because you're using not just, like, sampling, but some other, like, tail based sampling kinda strategy.

Yeah. All this stuff is gonna be very meaningful for changing how we actually

perform the day to day tasks of, like, building systems.

I'm definitely excited

to to see that development.

And in terms of your experience of building the Chronosphere business and building out m 3 and contributing to that and just running this metric system? What are some of the most interesting

or

unexpected or challenging lessons that you've learned in the process?

You know, I think, like, a lot of it is

the wants to do

everything under the sun. Obviously, that's impossible.

And then

really kind of, like, making a decision on

what is it that

that is important

and solving

those most important things every day and being religious about solving

the most important things that matter. You know, I think, like, it's it's easier

to

sometimes

work on a problem that's interesting rather than the problem that's most important. And so

I would say at Crunchy, you're really focusing on

making sure that

everyone is a master of their domain and all our teams are strong teams that can function

independently, and everyone is empowered to independently and autonomously

kind of deliver and work on the system. That has been, honestly, the most important thing even as much as I wanna get my hands dirty with a little bit of m 3, you know, still.

So, you know, I think like any it's just been a rigorous

prioritization

function

that apply every day.

And, you know, I think, like, everyone's

going through a different world with the pandemic as well and then working you know, shifting their lives around that. And, you know, that's been an interesting thing to do while also dealing with, you know, a new company

and this a baby that was born 2 months into the pandemic alongside

the other set of my family members. So, like, it's been a very interesting time and tons of challenges, but couldn't imagine myself doing anything else.

Congratulations on the new arrival

and on the work that you've been doing on the business.

Thanks, Devise.

And so for people who are looking at a system for being able to store and analyze and alert on their metrics, what are the cases where Chronosphere is the wrong choice?

I would say that right now, a lot of folks that are working with us

deal with, like, a minimum volume of, like, tens

of thousands of metric samples per second.

So,

usually,

you know, if anyone that's kind of, like,

running at 10, 000 or less

right now would probably find it hard

outlying for very significant large growth right around the corner.

They obviously probably have a challenge looking at us, but, you know, I would say that, like, the way that we view things that

a lot of cloud native and the way cloud native applications and systems are being built best work with Prometheus like metrics. And while there are plenty of other existing vendors out there like Datadog and others that are well established,

you know, I think, like, using

cloud native

technologies such as Prometheus

and

and Kubernetes,

you know, makes a lot of sense, and we're obviously

1 of the more

compatible with that. And so if you're kind of, like, yeah, in that stack and you're experiencing

growth in, you know, anywhere from 10, 000

metric samples and up, it's definitely worth coming and having a chat to us.

And as you continue to build out the business and the platform, what do you have planned for the future?

We're never gonna be

finishing

my mind with the

solving,

monitoring,

you know, like, a constantly evolving complex world. You know? And I would love to get into the space of,

as well,

being able to

go from a data point on a graph to a a line of code and explore

the very

different

things that we're like going on and with that code path amongst all your applications, everyone, how that's related to other code paths. Like and I think that source graph is a fascinating

tool that has a great reverse index on, like, code symbols.

And, you know, I'd love to see monitoring go to a point where we're

able to both be really intelligent about what you're doing and give you Google Now, like, kind of suggestions on, hey. We noticed,

like, your database has a lot of open connections to it for the kind of request rate you're doing. Do you need that many connections open? Because that could, you know, impact performance.

So a lot of these, like, more kind of, like, deeply integrated,

contextually aware kind of features in the monitoring space along with just being able to actually be much more well integrated into

the way that you work your code and and your systems in general. Both those areas, I think, will be a large area of investment for us over time.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

In terms of, like,

data management today,

I really think about it as the user experience

aspects of

both producing and then kind of, like,

viewing and harnessing that data. So I think that some of the largest problems with it is

this very, like, cookie cutter aspect to how people just do that task in general. I think that schema free

metrics is really, like, empowering on the producer side. But then when it comes to actually harnessing

that data, there's still leaps and bounds to go.

It should feel like a magical experience,

and there's no reason it doesn't need to be. It's just that much like tools have been progressing

every 5 or 10 years,

our systems are getting

finally able to, like, move huge amounts of data around. You know, Snowflake's able to obviously do a clone of an entire table in seconds now. I think there will be more

movements on features like that as a

consumer of the data

as well as a producer to really be able to manipulate and transform and categorize and

more natively think about the state and track it, there'll be huge improvements in this space.

Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Currentosphere and on the m 3 database. It's definitely a very interesting problem domain and 1 that, as I had mentioned, I'm very close to in my day to day work. So thank you for all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Likewise.

People listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links