Summary
Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.
- Your host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Chronosphere and your motivation for turning it into a business?
- What are the biggest challenges inherent to monitoring use cases?
- How does the advent of cloud native environments complicate things further?
- While you were at Uber you helped to create the M3 storage engine. There are a wide array of time series databases available, including many purpose built for metrics use cases. What were the missing pieces that made it necessary to create a new system?
- How do you handle schema design/data modeling for metrics storage?
- How do the usage patterns of metrics systems contribute to the complexity of building a storage layer to support them?
- What are the optimizations that need to be made for the read and write paths in M3?
- How do you handle high cardinality of metrics and ad-hoc queries to understand system behaviors?
- What are the scaling factors for M3?
- Can you describe how you have architected the Chronosphere platform?
- What are the convenience features built on top of M3 that you are creating at Chronosphere?
- How do you handle deployment and scaling of your infrastructure given the scale of the businesses that you are working with?
- Beyond just server infrastructure and application behavior, what are some of the other sources of metrics that you and your users are sending into Chronosphere?
- How do those alternative metrics sources complicate the work of generating useful insights from the data?
- In addition to the read and write loads, metrics systems also need to be able to identify patterns, thresholds, and anomalies in the data to alert on it with minimal latency. How do you handle that in the Chronosphere platform?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Chronosphere/M3 used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Chronosphere?
- When is Chronosphere the wrong choice?
- What do you have planned for the future of the platform and business?
Contact Info
- @roskilli on Twitter
- robskillington on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Chronosphere
- Lidar
- Cloud Native
- M3DB
- OpenTracing
- Metrics/Telemetry
- Graphite
- InfluxDB
- Clickhouse
- Prometheus
- Inverted Index
- Druid
- Cardinality
- Apache Flink
- HDFS
- Avro
- Grafana
- Tecton
- Datadog
- Kubernetes
- Sourcegraph
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Of your own. And don't forget to thank them for their continued support of this show. Today's episode of the data engineering podcast is sponsored by Datadog, the monitoring and analytics platform for cloud scale infrastructure and applications. Datadog's machine learning based alerts, customizable dashboards, and 400 plus vendor backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in 1 place, you can easily improve your application performance. Try Datadog free by starting your 14 day free trial and receive a free t shirt once you install the agent.
Go to data engineering podcast.com/datadog today to see how you can unify your monitoring. Your host is Tobias Massey. And today, I'm interviewing Rob Skillington about Chronosphere, a scalable, reliable, and customizable monitoring as a service purpose built for cloud native applications. So, Rob, can you start by introducing yourself?
[00:01:47] Unknown:
Hey, Tobias. Yeah. Thanks for talking today. My name is Rob Skillington. I'm the CTO over here at Chronosphere, which is a cloud native and scalable monitoring and observability tool. I'm looking forward to kind of talking about data monitoring and and everything
[00:02:07] Unknown:
else. Do you remember how you first got involved in the area of working with data and dealing with metrics and logging and all of that and fun stuff? As a developer,
[00:02:16] Unknown:
you know, going back to my very first, like, few jobs, is obviously kind of observing and collecting data about, like, how your application or or the data that you're storing is obviously always a core competency. But I guess I got more involved when, you know, I was working on a reporting system that had to serve reports for different search terms for searches that were done on a kind of a a marketing inventory side for architectural products. And so this was a a, marketing firm, like, marketing company that hosted several catalogs online for different products, and we built a lot of reports for them to for them to understand how people were performing searches on the sites and what kind of terms led to this and that and then, you know, just a fundamental the amount of data that we had to collect and then aggregate to give any meaningful sorts of insights into that data was pretty interesting, and we did a lot of work to to aggregate and roll up that data to make that product experience worthwhile.
That was something that I worked on while I was at university still. And then when I went to Microsoft after leaving university in Seattle, I I kinda worked on monitoring for the Azure Active Directory team. You know, it was obviously on a whole different level of scale, and I kind of got to know a bit more about the pipelines there and how the different teams and different business units kind of, like, exchanged data in general. And then, obviously, I was actually kind of working on some observability some monitoring and observability problems over there at Microsoft as well. So that that was the first 2 kind of ditched the toes in the water.
[00:04:02] Unknown:
And so I know that you also spent a chunk of time at Uber, which led to your cocreation of the m 3 storage engine for metrics there. But before you dig into that, I'm wondering if you can give a bit of an overview about what it is that you're building at Chronosphere and your motivation for building a business around monitoring and particularly the focus on cloud native?
[00:04:23] Unknown:
So I've definitely at Uber was where I like to say that we were given the opportunity to really dig our fingers into a pretty thorny problem that the business was facing, which was operating at a level of scale with the amount of real time data that was flowing through the business and giving them the foundations to be able to make sense of that and also be able to reliably operate their systems. How that kind of happened was I joined the team after working on actually some payments infrastructure and that I followed a few friends along from to Uber, which had a lot of similar problems.
When a payment system is down, you have a lot of very frustrated merchants because they're literally losing money every second that you're down, unable to run their business and take credit card transactions from their customers. And then at Uber, you know, a third level of reliability was kinda similar to that. Like, if you were down in a city for more than 10 or 20 minutes, there were a lot of people out of work, and that caused very real brand impacts as well as, obviously, put people, you know, out of their jobs for meaningful amounts of time in the day, which was just a huge problem that we just couldn't kind of risk that kind of constant level of reliability problems. So when I was kinda, like, working on that team on the marketplace matching system that I joined Uber before, after about 6 or 9 months, you know, when we'd kind of rewritten the Node. Js dispatching services into more robust set of microservices for dispatch. I got the chance to kind of look around and see what else I wanted to help the business with and metrics, both assistant level as well as operational level metrics that was used very widely by the rest of the company was something that was causing both the inaccessibility to really use them at the level that folks wanted to as well as the the amount of scale problems that the business face to continue to make that a tool that was actually useful to developers at Uber was was something that I wanted to dive into and spend pretty much the next, kinda, bit years kind of solving there with my cofounder, Martin.
[00:06:52] Unknown:
In terms of the challenges that are inherent to monitoring, I know that there are a lot of usage patterns that are unique to metrics and log data that aren't necessarily represented in other types of data use cases. So I'm wondering if you can just talk about some of the complexities that arise because of the ways that monitoring information is used and the patterns around that and how the advent of cloud native environments and workflows complicates things further?
[00:07:22] Unknown:
I think that, basically, when you talk about, like, storing data and the amount of information that you store, a lot of people tend to think, perhaps IoT is massive scale. Perhaps, you know, like, exploring the data collected by lidar on a self driving car is massive. And while those use cases can be large in individual deployments, they're not at the scale of how much an individual computer and even your laptop can emit information about it in a very high frequency nature. Your mobile phone is running hundreds of processes and sending tons of data to the Internet every second.
And so it tends to be actually, like, outside of all these, you know, use cases that we're thinking are on the verge of, like, causing large volumes of data for us to collect and analyze. While that is true, software itself is probably the leading use case for both constructing information about how it's running. And then the volume that that is at is tends to be 1 of the highest volumes of information recording that we're seeing in the real world, at least, generally, I think most people are in agreeance that just the level of scale and granularity that information is kind of can be collected at and looked at is is rather massive.
And so, you know, as you kinda mentioned, how to even harness that is is tough and difficult. There's so many intangible things about software, and so you kind of have to limit your scope to, you know, what kind of things do I want to actually pull out of this infinite sea of data that could be generated about a piece of software running on a device somewhere. And, yeah, today, we extract 3 main kinds of data, which is log like data, so something that you as an application developer wrote to capture about an event that your program did that you wanna look at later. Metrics, which is more like a signal that could be kind of omitted from your program in a more type safe way, I would say, than logs. You know, you really you you've got to choose a type of metric, and then you've got to choose the different specific dimensions that you wanna be able to pivot on that metric.
It's kinda less free fall in the logs, but then by the very nature of you having to think about what that metric is to expose it, can be more meaningful to you as someone when kind of, like, looking at metric data or first log data because you already put in the thought to describe kind of, like, what you're measuring with metrics. And then tracing is a very interesting kind of compound view of both that event data that's kinda happening throughout your system but then tied back to an actual individual component that's performing it and then being able to visualize the events for a given kind of request or a given kind of flow of a user using your system as it crosses the different component boundaries. You know, in the cloud native world, we think of most of the time those boundaries are bits of code executing between different microservices or back end services, and that's kind of where, you know, the component is is sought to be captured at the granularity of an event being attributed to a component. But, you know, tracing can also be used, obviously, on a mobile device on a mobile device to look at, you know, a complex piece of iOS or Java software is running on your Android or or iPhone device as well because you have many code libraries, obviously, that you call into from your application, and then, obviously, the operating system is doing things. So so tracing is kind of a way of, like, looking at a call diagram across component boundaries, and those component boundaries are different based on what you're kind of observing.
[00:11:18] Unknown:
And the tracing aspect of things is definitely relevant, particularly when you're working with distributed systems, which is a common problem domain in the data space where you want to be able to understand what are all of the interdependencies between these systems based on just this 1 request and also with the case of microservices and the advent of cloud native kind of expanding the availability of that as an architectural pattern. You might have 5 different teams, each supporting 5 different services. And unless the top level system architect, you don't necessarily know what are all the interconnection points. So if you're trying to debug a problem in a service that's nested deep in the stack, you need to be able to understand what are all the systems that it traversed and what were the actual function calls that happened along the way to be able to even comprehend what might have gone wrong.
[00:12:14] Unknown:
Right. Yeah. That that's where tracing is great at helping you orient around a problem and kinda start to understand the complex relationships and code paths that are kinda executing for in a in a very black box way. Yeah. I think, like, as we obviously continue to develop more services and more, products in it on top of cloud native infrastructure and systems in general, yeah, it's going to be a fundamental building block that we're all gonna be pretty reliant on in the near future.
[00:12:47] Unknown:
In terms of the actual metric storage, I know that, as I mentioned earlier, you helped to create the m 3 storage engine at Uber to be able to handle the scale of metrics that you're trying to deal with. And I know that there are a large number of different time series databases that are on the market. They all have different optimizations that they build in and different target use cases. I'm wondering if you can just describe what were the pieces that were missing in the overall ecosystem of available time series storage engines that made you feel that it was necessary to build a new 1 from scratch to be able to solve your particular problems?
[00:13:25] Unknown:
Great question, and something that was not an easy decision to make. As you mentioned, there is so much out there today accessible for collecting and storing time series like data. I guess my answer really starts with kind of understanding why there are are there so many different types of kind of time series like or specialist databases out there today, right now. And I think the main reason we're starting to see this kind of I'm not gonna call it explosion, but, you know, I think there's these ebbs and flows of, like, a new problem appears and multiple solutions kind of enter the market to try to solve that problem. Then there's a natural consolidation point as well after, like, a few start to have more experience with it. And, you know, I think much like some folks who reach for NoSQL solutions for almost everything that they were building a few years ago, I've naturally fallen back on more relational like databases, whether they're distributed or not, like Postgres or or something more distributed like CockroachDB.
I think, like, similarly with, you know, the high volume of of data that we're trying to utilize today, much to what we talked about just earlier about, there are some of these specialist use cases appearing that there's a very real need for ways to solve those problems. And, you know, back when we were looking at time series databases at Uber after say the say the most obvious 1, which is since it was a central metric store for us for operational and system metrics as well as some business metrics for for, like, real time business metrics we looked at. It fundamentally didn't have the level of reliability we needed. And so, you know, if we actually wanted to expand the Graphite Whisperer back end, we would have to take hard downtime on a subset of the metrics that we were collecting because there's no, like, Kafka, you know, in that pipeline that could, like, buffer the data off for you and then write it like that system is back again. And there's no replication, so it's, like, fundamentally, a single replica of the data. When you're trying to expand 1 region of that metric space, you take that whole percentage of that metric space offline, both for reading and writing. So I think the problem that we faced was how to reliably store telemetry data in a way that wasn't gonna have hard downtime, that was more reliable, and could also service the the growing set of cardinality that we faced during our move from, like, physical on prem processes, to containerized workload and which generates just so many more smaller units of compute that have to be tracked as well, forming much more higher levels of relationships between the metrics because now, you know, your lowest granularity, instead of, like, a large physical server with a host name with 48 cores, it was a container with 2 CPUs. So you would naturally kind of, like, have 24 times the number of logical units of compute. They were just smaller. But you still needed to track the metrics to each 1 to kinda make sense of how things kind of were, you know, operating as well as the the health and the units of work they were all doing individually.
So that was the major reason that led us to even just looking around at the problem space and seeing what else was out there. And then at the time, you know, I think, like, there was, much like there is now, a lot of different things doing trying to solve different problems. And none of them seemed focused on this computer software telemetry horizontal scale out story. So, for instance, like, you look at InfluxDB back in the day, that was much more of a general purpose time series database. It was more focused on, like, offering different things you could do with it rather than a horizontal scale out story. And if you look at, like, all the early versions of it, you know, some of the ways they describe scaling out InfluxDB was you put, you know, a Kafka over here, and then you replicate that Kafka data between, like, multiple nodes, and then you partition the data in front of it. So it wasn't really, like, at that stage, you know, trying to solve horizontal scale out. ClickHouse was and still is today more based around processing, like, you know, a bunch of, like, vent like data. It's not really focused on metrics.
And so that, while it has some ways to scale out, it it also has just fundamentally kinda, like, wasn't really built for metrics as natively as other solutions are like Graphite and Prometheus. And Prometheus was really interesting at the time because it was had just kinda entered the scene in 2014 in open source world. However, you know, it was fundamentally and still to this day describes itself as something that is very focused on a the a single node experience and focusing on they're not focused on the horizontal scale out story of data at the individual Prometheus server level. And so, yeah, I mean, there were a bunch of other ones out there as well that I could explicitly mention, but what it really came down to was, like, this data needs real time access.
You know, if you can't monitor something that broke and get a signal on that within a minute of it breaking, then it doesn't matter because that's the whole purpose you're using it for in some deployments. And so really need something that has fast or multidimensional data. And, you know, a lot of folks were kinda using Elasticsearch for that, but then, really, Elasticsearch is better again for structured events, not like metrics. And so we needed something that had a fast inverted index so we could do these multidimensional queries very quickly that didn't lag and could do real time alerting, and that also was schema free. So unlike Druid and other things, you know, we needed it to be schema free to be able to let developers continue to instrument the code the way they were today without not really thinking immediately about how that, you know, data was gonna be structured and stored in a database. They just need to be able to write a few lines of code and just assume that the backing telemetry telemetry store can deal with storing that and enable you to query for it later in some reasonable query pattern. So I would say it comes down to the fact we needed a schema free telemetry store that could support real time alerting and allows developers to get quick insights into how their software or business use case, was performing in their code.
[00:20:33] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. And often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask. Definitely wanna dig more into the concept of cardinality and the challenges that that poses and the concepts and sort of best practices around data modeling or sort of schema management or lack of management as the case may be for this type of data and the workloads that it supports where I know that kind of the death knell of a number of different metrics engines is high cardinality data where you can maybe tag something with the host name and the metric name where it's, you know, FU host CPU 1 frequency.
But then if you need to then add additional information about the host and the application and the function call and the timing information, and you might have, you know, 8 levels deep of cardinality, or you might want to have a top level tag, but then have additional other tags or meta information associated with the metric that just causes all of these systems to buckle because of the fact, as you said, you need to be able to have them indexed and available in near real time. I'm wondering if you could just talk through some of the structures and the approach that you've built into m 3 for being able to support those types of workloads in this near real time need and in this largely write once read never type of workflow?
[00:22:51] Unknown:
That's something that, yeah, a lot of folks have to deal with, and it's not a fun problem to deal with. I think, like, when you first get started, as I kinda did when building out services for the dispatch team early on, you start to see how powerful just metrics in general are both for, like, measuring, you know, features and, like, to see how different code paths are performing, how many people doing different types of call types, yeah, divided by which dimension on, like, a type of product or using a different type of, like, mobile operating system to call your back end and stuff like that. Like, all of that, I think it's very easy to see the power of it, especially when you're using, like, a client library that lets you, like, instrument quickly and kinda see the data quickly.
But as soon as you run into a cardinality problem, you start to have to put in extra work to really understand, like, why is, you know, your queries for this type of metric data suddenly become extremely slow, and understanding the why is something that, you know, kinda degrades from the magical experience you had of just being able to instrument and get answers about things quickly. The scheme of free thing is, you know, really attractive, but then you run into this cardinality problem quickly when, like, you don't think about how adding an extra dimension to the metrics can explode the space of the different permutations of types of data that you'll actually wanna look at now and query over.
So I think the things that Chronosphere and using m 3 as well, you know, behind the scenes kind of try to face this problem domain is shifting it away from you having to fully understand why that cardinality problem exists. You know, the first few things that we try to do there is give you the levers at least to correct certain telemetry signals that have too many dimensions on them. And so the example I really like to use here is you have some kind of web app, and you run it in 2 regions in a cloud provider, say, and then you have 2, say, mobile clients or that could even just be different, you know, JavaScript bundles, so, like, web app versions that both access that back end. And you could imagine that 1 client version, you know, hits like a front end service and then talks to an API server, and that talks to MySQL to get some results for, say, a search result. And then you may have another client app, so let's say Android versus iOS, where maybe the Android version also has, like, a support chat feature to it, which talks to Redis to get, like, the latest messages for you. So this iOS app is calling your web app accessing MySQL, and this Android app is accessing the web app but talking to both MySQL and Redis because MySQL is searching the search results, and Redis is seeing if there's any new messages to display in the app. And so, typically, you can imagine, what if your Redis fails in this this app, but it only fails in 1 of the regions that you're operating in, so say US West instead of US East?
So the best way to monitor your software is to really observe what the user is seeing. Right? So because, like, you could honestly monitor thousands of things that are happening in your system, but the ones that really matter are the ones that, like, literally cause, like, an error bubble to show up, like, on your end user's device. Right? So monitoring from the edge, you know, you probably get an alert that you've you're seeing some internal server errors or 500 responses being returned from your web app. To actually start debugging this problem, you need to know the HTTP route, so, like, the API slash search HTTP route. And you need to know the client version, the fact that it's an Android client, and you also need to know what region it's in, which is, like, US West.
You start to add up, like, some of these dimensions here. You wanna search on the HTTP route, and maybe you have, say, a 100 HTTP routes that your web app, you know, which has been around for a year or 2, now has. You wanna alert on the status codes. So let's say there's, like, 5 major status codes, you know, 2xx, 4xx, what have you. Maybe you're running in, you know, 6 to 12 regions because you're deployed close to where these devices are running, and maybe you have, like, 40 different client app versions because maybe it's not just iOS versus Android. It's actually the actual type of version as well as the platform that's calling you. So these are the type of dimensions you want to start to think like, hey. Oh, it's a search endpoint. Oh, it's Android v 2. Oh, and it's in Ye West. I'm also seeing that Redis is unhappy. It's correlated to this code path. So, anyway, to get that level of dimensionality on your like, how your requests are performing at the end at the edge, say you had some of those numbers I was talking about, a 100 endpoints, 5 status codes, 12 regions, say, and 40 client versions.
That's 240, 000 unique time series, and so that's a pretty high cardinality metric that we're talking about, and that's just capturing status codes at the edge of your entire web back end tier, so at the edge of where your environment finishes. But, you know, there's things you can do to make that quick, which is, like, remove dimensions off of it, but then that makes, you know, the alert or the data that you're looking at much less valuable too. And so, also, imagine if, you know, you wanna put in the country where the user is calling from. Perhaps you have some different logic based on the user's country. You know, that's a multiplier of 250. So now you're multiplying 240, 000 time series by 250. And so I guess, like, it's very easy to get into these, like, carnality explosions, and that's kind of, like, 1 of the better concrete examples that, I like to talk about.
And a lot of people, I think, find you know, have derived a high amount of value from kind of adding like, at some point, it gets a bit ridiculous adding on every nth degree dimension, but some of these are not, you know I don't like, we talk about HTTP route, status code, region, a client version, and maybe, like, some data about the the market that you're running in. It's not like a crazy ask to have those kind of dimensions. So, yeah, moving a little bit onto, you know, how we think about solving this, it's really about kind of giving users the levers to instrument first, not worry too much about cardinality, and then be able to actually make sense of the data later. So 1 of the big things that Chronosphere and m 3 do does is, you know, provide a both in front of the time series database, we have a streaming metrics aggregator. And you can think of that similar to platforms like Apache Storm or Flink. And what it's doing is it's transforming the data as it's being emitted and doing that by, you know, acting on these messages in memory and then computing aggregates and then passing those aggregates on to the time series database. So you can imagine that by default, a lot of this data that I even talked about just then might have, like, a container value on it. Now if you have, like, a 100 containers deployed, now you're taking that 2 40, 000 time series, and you're multiplying it by a 100 again to get into the the many millions.
So sometimes people say, I want that level of granularity, but I actually maybe for latency or for other types of data, I don't actually need the container level instrumentation. I want the container level CPU and stuff like that, but I don't need, like, the crazy cardinality on their quest metadata. It's really easy in our platform to kinda, like, profile the metric stream as it's coming in. We do weighted sampling to kinda show you, and we use reservoir sampling, the the weighted algorithm, to kinda show you, like, what the stream of metrics coming in by its dimensions are and lets you, like, group by different tags, other dimensions, and then kinda slice and dice by the highly unique ones or the ones that appear on everything with a low frequency of unique values so you can kind of, like, divvy up the metric stream. And so that lets you kinda, like, see the shape of your data, and then you can, like from a few clicks from there, you can start to create pivots on that. So derived aggregated metrics that can be that at streaming time are pulled in and aggregated and give you much, much faster roll ups on these views so that you can both alert on them, but also see, like, you know, look back 30, 60, 90 days or a year, and the graph actually loads instead of going to have to process, you know, hundreds of terabytes of raw data across those millions and millions of unique time series to to actually give you a response there.
So a lot of what we think about with, like, m 3 and and and Chronosphere in general is about coming back to that problem of, like, to give meaningful fast results just like I was talking about on that the architectural website I worked on ages ago, this data has to be aggregated to be useful, and you also need to find the slivers, you know, of those time series quickly. It's not fast enough to grab through them like you would with log or structured events. The other thing as well is, like, if there is some industrial use cases you wanna onboard that are high cardinality and you wanna keep the raw data as well, you need to be able to horizontally scale that out quickly. So, you know, if you're kind of, like, pulling data into these silos and different monitoring databases that aren't connected to to each other, it becomes really hard to ask a question about, like, over high cardinality data that kinda where the data lives in all these different systems. So much like the benefit of HDFS, you know, brings to big data and, like, a data lake typically brings to businesses by kind of centralizing that data, being able to reason on it in 1 place, We wanted that ability by making, you know, m 3 and and Kronoshear as a scale out story for you so that, you know, if you if you did want to quickly double the capacity of time series data you wanted to actually look at, you could. You could also use rules so that you could keep some of them for longer than others because that was another thing. Like, you really wanna be able to pull levers and say, like, this data, I care about these dimensions.
This other data, I care about keeping raw, but only for x number of days. So all those use cases that I've kind of just talked about there wasn't really available on the market. And to this day, you know, I still think of these all as, like, fundamental building blocks to be able to make sense of data at this level of scale. And that's what we're all about, and that's kind of why we're here.
[00:33:58] Unknown:
Digging more into the Chronosphere platform itself, can you talk a bit about how you've architected that and some of the features and capabilities that you've built on in addition to what m 3 offers out of the box?
[00:34:11] Unknown:
Chronosphere is, kind of solving more of that mission, like I just talked about. M 3 really is, you know, a piece of infrastructure that is the building blocks for doing what we're talking about, which is being able to make use of an increasingly higher level of chatty data that can give you much more interesting answers than it could before because it supports arbitrary dimensions. The Connoisseur product in general is about adding smarter rate limiting in front of that data stream. So sometimes you wanna, you know, kind of customize these metric use cases yourself, and then other times, you just wanna kind of use the platform as if it was an unlimited resource. But then when, you know, you do something that is an extreme explosion on data, you kinda want the system to just look after itself. So what Chronosphere Cloud brings to the table is intelligence to kind of describe your organization that's kinda collecting this data. And that can just be as simple as, like, every metric that's coming from team a has team a's tag on it or label, and team b has team b's label on it. But kind of, like, Chronosphere is, like, highly aware of that contextual link that you build and then can essentially say, oh, okay. Well, if 1 of the applications that team a owns is starting to, like, emit way more data than it it did on a Friday afternoon when some engineer, you know, deployed it, we're gonna clamp down automatically on just that kind of metric family on that application for team a, and team b is not gonna be interrupted.
Most of team a's applications won't be interrupted either. You know, ideally, we'll just be dropping data from this kind of new set of cardinality explosion that happened from from that team over there. And then, you know, there's a lot of, like, the management side of things that goes on as well. All of the features that Chronosphere gives you is source controllable. So, you know, all of your alerting definitions, all of your aggregations that you're using with our metric streaming aggregator can be defined locally. And we have a Terraform provider, so all your alerts can also be declared in something like Terraform. So a lot of it's about, like, making it a service that fits in very neatly into your engineering workforce and or developer workforce and just allows everyone to treat it almost like, you know, the Kubernetes of monitoring, essentially.
We have a command line tool and and other things out there to kind of, like, help automate bunch of this stuff that you're kind of like s SREs and and other, you know, developers locally that are that are helping you set up things, do. Yeah. A lot of it's around information management. A lot of it's around the ergonomics of, like, running at this level, like, collecting this much data. There's a lot more that we're experimenting with, of course, as well with kind of showing you a trace view of this data with 1 click from a dashboard and a data point and showing you you know, getting you from 0 to somewhere much quicker than is typical that we're also working on.
[00:37:20] Unknown:
Digging more into the actual use case of metrics, coming from somebody with an operations background, my automatic default is thinking about systems and application metrics, which we've been talking about so far. But what are some of the other types of metric sources that people are working with, particularly if they're in a data engineering or data science or machine learning context?
[00:37:42] Unknown:
So some of the more interesting things, you know, that we saw at Uber being used with metrics was things that you I guess you typically wouldn't have thought that would be stored in a a system metric store. So for instance, there were folks developing features that kind of measured how long it took to process, like, a user's request. I mean, a given request for some you know, for to, like, call an Uber at an airport, and folks could kind of just decorate their code with, like, timing information and then quickly be able to, like, graph that and give that back to operations that were kinda, like, at the company.
You know, typically, like, you could build a feature like that by, you know, capturing how long that dispatch took, saving that to a MySQL database or some other data warehouse and then kind of, like, doing a more typical map reduce job. But, a, it doesn't give you the data in a very real time nature, and, also, b, you have to go through the entire process of data modeling for that event and kind of, like, making sure that gets into the data your local company's data pipeline, whether that's using Kafka. You know, you have to choose, like, a schema if you're using Avro or some some other structured format to describe that event. And then, of course, you saw, like, you get a graphing view a lot of the time. You're stuck with, like, getting the SQL output of or some kind of MapReduce output of, like, a Hive query for that data. And so the turnaround time on getting some of those answers and then also being able to monitor on that data was just a lot quicker to do with metrics.
So, for instance, like, the mobile app experiments was a lot easier to kind of monitor in an aggregate form in the in the metrics and monitoring store than it was in the typical data warehouse for getting, like, responses on, hey. I'm rolling out a new experiment. I just turned it on in these, like, in these geo areas or with these client versions in the beta channel. Like, you know, did things get slower in terms of request latency? Did certain events like kelp support tickets being created go up or down? So you can kinda, like, measure core KPIs and drill down on a per experiment basis on, like, you know, how is that experiment doing? Was that degrade? Like, was there likely a bug with that experiment that wasn't showing up in any back end metrics, but you could see from a divergence in the front end, you know, the how the product was being used in the mobile app and do that in in kind of in real time without waiting for a data pipeline to kind of process all these events and give you anything meaningful a few hours later. So, you know, that I thought was 1 really interesting. Both of those use cases are fairly interesting. You'd kinda typically think of them more obviously being solved with a more typical, like, pure data stack, but had a lot of value in being replicated that data or even sometimes starting out in the metrics and monitoring store before being modeled you know, to a high level of detail and then stored somewhere else for more, like, deep analytical use cases.
Those are 2 ones I think that is definitely interesting. The other 1, of course, which you kind of alluded to was, you know, the metrics and monitoring store being used for kind of measuring the performance of machine learning models. Both the models and the features, you know, tracking things like their availability capacity, utilization, staleness, checking, like, the online feature serving, how long that took, how much throughput of errors, or other kind of signals throwing off about how accurate the models themselves thought they were performing. Things like that were really interesting and gave a lot of quick turnaround time for people that were iterating on machine learning models and stuff like that. And then, you know, for things like what kind of, like, the ability for us to just surge our like, a scale out model and just add servers to this kind of consumer grade back end. Right? Like, m m 3 and Chronosphere is really just about being able to look at everything as something that can just be use consumer grade, like hardware to just scale out quickly and cheaply.
And so that kind of let us run things like monitor how was this how would different surge algorithms actually form? And so while surge algorithms were running for different geohexagons, like, a few different versions of what the surge algorithm could be tweaked to do would run at the same time and then emit their values as a discrete, like, float value into the metrics monitoring system. And, therefore, you could kind of, like, go and see how surge would have been at different points in time using a different algorithm. You know, similar to the machine learning model case, just having that ability to kinda say, okay. I'm gonna, like, purchase a few more, like, computer instances here and collect this data. I only wanna keep it around for, like, 7 days or, like, a few days. But, you know, just being able to scale out that out and then scale that back in really easily was super valuable.
So, you know, the the the kind of, like, surge algorithms we're talking about here ran over. Like, they could touch any kilometer wide hexagon. So we're talking about tens of millions of hexagons. And then if you divide, you know, that number by 60 and you're collecting value every 60 seconds and there's only 10 or 20% of these hexagons ever computing a value for it, You get into the 100 of thousands per second range, which is fairly easy. Like, a single m 3 node can do 100 of thousands of time series per second, the collection interval. So now that you have the ability to, like, quickly scale up and add capacity to collect this kind of data, like, the kind of things you could do and the ways that you could look at it, especially using open source tools like Grafana, become a lot more accessible rather than typically, what you would be doing in other platforms a lot of the time is instead of being able to graph this stuff, you'd have to write large job, maintain a pipeline, and do a lot of work to to kind of, like, visualize or make sense of those results as well, even once you got it in through that pipeline.
[00:44:03] Unknown:
On the side of actually reading the data back out, there's the fact that a lot of the times when you're issuing a read, you're only going to be interested in in a small subset of the overall number of metrics that are written. But also, the reading operation isn't necessarily going to be a human behind it because a lot of the use cases for storing metrics is for monitoring and alerting where you want to be notified if there are certain patterns in the time series, or if you run over certain thresholds in a given metric, or if there are anomalies in 1 or a grouping of metrics.
And I'm curious how you handle addressing that with low latency and low enough latency that actually getting an alert is meaningful and that you can respond to it in the context and in the timeline that is going to be reasonable for making sure that the system stays healthy and so that an end user either doesn't experience an outage or that their experience of any sort of outage or downtime is minimal?
[00:45:03] Unknown:
Being able to do that is 1 of the major reasons you react to an event quickly or be able to roll back from a problem quickly, even when you're deploying extremely complex software is the major reason why most people, you know, invest in their monitoring and observably infrastructure. You just couldn't really build systems this complex, you know, have this many kind of different accesses and code paths like, 10 years ago because, like, you just would have no way of even working out what's going on unless you waited through tons and tons of log data. And even then, like, the tools weren't good enough, so you couldn't even look at any of that data in, quickly enough or in any meaningful way.
So, yeah, that is the majority of the challenge of kind of, like, finding these subslices of time series you care about and being able to alert on them in very near real time, like seconds after the events happen. And so, you know, a lot of what goes into that for us, it's both, like, solving that at a technological layer. So, like, signing and architecting things so that they can do very high frequency evaluation on this data. And then it's also, of course, about, like, operational excellence. You know, if you have a monitoring system that's going down all the time, then you're really only getting so much coverage.
And, yeah, you're kind of flying blind for significant periods of time. I think of the 2 axes in terms of, like, being able to do it at scale and then also being reliable enough that I can actually be be dependent on that. It it will guarantee tell you when there's a problem even if it's, like, this tiny thing experiencing an issue in the pretty large set of infrastructure. And so, you know, some of the things that make this all possible for us is, obviously, we replicate the data 3 times so that you can lose, you know, multiple instances of all your data and everything keeps humming along. So that doesn't mean that you just have a, you know, a gap in your monitoring. You have to actively lose multiple availability zones in a compute region that's storing your monitoring data to experience any any sort of real outage because the replication takes care of that for you. And then the other 1 is, you know, for actually being able to learn on these quickly, it's about having a fantastic reverse index.
Combined with a streaming aggregator that can really compress those signals that you need to look at. Like, it's not economical to just process, you know, the huge amounts of data that you're ingesting all as, like, as a query time. For instance, as I just said, like, the example we walked through was you have an internal server error being returned from an edge node to a single mobile phone, you don't really care which container that came from when you wanna be just notified about the problem. Right? Like, you may care about what that container was as soon as you do get notified. But for the notification to get triggered, the internal monitoring system can peel that dimension off of your metrics. And so, you know, what we do is is obviously make it possible for, like, looking at the query, working out which dimensions you're actually querying on, and then form an aggregate view of that so that at least for determining whether the alert is being tripped or not, we can look at much more aggregated versions of this data, which has far a few distinct time series that need to be loaded and evaluated very frequently.
And so the very fact that we're doing that in memory and aggregating that on the way into the time series database means that we have a lot of bandwidth on the the time series database itself to do a large amount of evaluations. So the other way to aggregate this data is, obviously, like, take all those container ID like, data config of those containers after they've been stored in the time series database and then kind of squash them together into the aggregate view. But the problem is you actually need to issue a query against time series database to do that. So it's both about squashing data on the way in to the sets of dimensions you care about when you set up an alert. And so that gives us a much fewer distinct series, and then that reduces the load on the time series database because now all you need to look for is, like, that much smaller set of time series and evaluate against that. And then the fact that we're doing the aggregation on the way in means that you could have, like, 100 of 1, 000. And at Uber, we had a 150, 000 alerts set up that'll, like, could notify you within seconds of when, you know, an event happened, all set up against this database.
And so that's kind of, like, the major architectural ways that we get to, you know, having a highly reliable system that's always up that can also collect a lot of these signals at very high cardinality, but then give you an answer about when there's kind of, like, an expression that's tripping. And then from there, you can go and look at the raw time series very easily by just clicking on that alert expression and seeing the raw underlying data and see which, like, container is misbehaving. And then, of course, like, to to actually get to those few distinct time series that we're talking about, that reverse index and it being scalable is incredibly important because it allows you to take a multidimension set of values, map that into a postings list, and then pull, like, the exact time series metric query rather than having to go and do, like, a scan or a search of, like, a sorted string table or anything like that. So it's kind of like the very opposite to a pure column data store. You're really marrying an inverted reverse index, which is much like Apache Lucene and Elasticsearch is built on top of with a column store that has a highly compressed set of the time series' actual inverted index to go and quickly find which 1 of those, like, billions of time series to kinda present based on your multidimensional query.
[00:51:02] Unknown:
Yeah. It's definitely an interesting problem domain and a lot of complexity to dig into. It's definitely interesting to see the number of different ways that this problem has been addressed and attempted to be solved and how each time there's been a sort of architectural shift in the ways that people are building and deploying their applications, it leads to another generation of metrics engines and time series engines that are needed to be able to keep up with the growing complexity and the number of different sources of information and consumers of that information.
[00:51:37] Unknown:
Yeah. Definitely. And I think, like, how we're solving it is is really just trying to give you, like, the magical experience that you get when you first use these tools. But, you know, solving some of these fundamental, like, questions that we're asking of our software is, you know, I think, like, can be achieved in multiple different ways. You know, we're obviously optimizing 1 experience. And over time, I think, like, yeah, monitoring in general will change. Like, it much like civil engineering has, you know, certain ways things are done, and are codified and have been codified over a very long amount of time.
Software engineering is codifying how we, you know, think and operate on building systems. Yeah. Monitoring it and observability is fascinating why I'm happy to be doing this for 10 more years because it is a very core pillar of how we build systems, how we will continue to be able to build systems in the future, and it is 1 of the most fundamental building blocks for being able to actually build a system that works.
[00:52:41] Unknown:
And in terms of the actual use cases for Chronosphere and m 3, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:52:53] Unknown:
You know, we have different customers that are doing things like collecting a whole bunch of telemetry data about physical storage systems and and kinda, like, collecting them. So there's definitely, like, IoT use cases. We also have customers like Tekton, who, I believe, was on your podcast just a few episodes ago. Actually, if you visit our website, you can see their write up on how they use Chronosphere, which is for basically allowing, monitoring the feature store that they run and also providing other metrics to their end users of their machine learning platform. You know, I think, like, that is definitely a fascinating use case.
There was, yeah, a bunch of times, where I've also seen this kind of, like, M3 in general used for things like storing a whole bunch of telemetry data from self driving vehicles, which I thought was pretty interesting. There's a lot of different things that you would wanna kind of inspect about a dataset and and make decisions on very quickly that only a system that gave you very low latency access to time series data can do. And so, yeah, those are a few of the use cases. But, you know, I'd I'd like just in general, I don't know what you know, how you feel as, obviously, you're experienced in this space about running software and and kind of, like, reasoning about it, but I I love to hear your thoughts on this. I, in general, feel like there's a lot of more higher level concepts and signals about the very code people are writing today that are going into things like metrics that wasn't before.
Leading to some really interesting things, you know, like order rollback of systems, telemetry about, like, how things are kind of communicating with each other, you know, outside of the data center. So I think, like, some of those are super exciting. They're more generic software monitoring patterns that are starting to appear. But, yeah, do you have any thoughts on the evolution of how that's changing?
[00:54:53] Unknown:
Yeah. I definitely think that the availability of metrics engines, particularly as a service, has driven a lot of adoption around being able to actually instrument applications. And with the growth of containers and the corresponding growth in smaller service sizes and the rise of things like open tracing, a lot more teams are actually starting to adopt things like data sampling and request sampling to understand how their systems are communicating with each other, how their systems are interacting with external dependencies, which was generally more of a black box, bringing in more information from things like the database to understand what are the latencies and query time, how can you optimize the code paths there.
Definitely a lot of opportunity for being able to actually bring that information into the development life cycle to ensure that you can improve your product without having to waste a bunch of time just trying to figure out what is actually happening, why are these code paths slow, and just, you know, digging through code. You can, as you said, just throw a metric on something, release it, and then quickly get feedback on what the behavior is. There's definitely been a much bigger focus on observability as a first class concern of building software systems than there was, you know, 10 or 15 years ago. And I think that it's definitely a positive trend and 1 that I'm happy to see continuing.
[00:56:24] Unknown:
Yeah. Yeah. Most definitely. I mean, 1 of the major things that we're experimenting with now is, like, native trace storage in v 3, which I think is really interesting. Like, typically, back in the day, a certain user was experiencing a problem. You would start a giant grip on all your logs. And, you know, they kinda like like in your entire distributed system, like, where finding the logs from each system was also you couldn't do, like, a query across all systems for logs for that user. But with things like tracing and being able to actually, like, you know, index certain fields on this, like, say, for instance, not indexing everything about a trace, but things like things that are important to you, like user ID and being able to reliably get a, like, a trace back from that because you're using not just, like, sampling, but some other, like, tail based sampling kinda strategy.
Yeah. All this stuff is gonna be very meaningful for changing how we actually perform the day to day tasks of, like, building systems. I'm definitely excited to to see that development.
[00:57:27] Unknown:
And in terms of your experience of building the Chronosphere business and building out m 3 and contributing to that and just running this metric system? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:57:44] Unknown:
You know, I think, like, a lot of it is the wants to do everything under the sun. Obviously, that's impossible. And then really kind of, like, making a decision on what is it that that is important and solving those most important things every day and being religious about solving the most important things that matter. You know, I think, like, it's it's easier to sometimes work on a problem that's interesting rather than the problem that's most important. And so I would say at Crunchy, you're really focusing on making sure that everyone is a master of their domain and all our teams are strong teams that can function independently, and everyone is empowered to independently and autonomously kind of deliver and work on the system. That has been, honestly, the most important thing even as much as I wanna get my hands dirty with a little bit of m 3, you know, still.
So, you know, I think like any it's just been a rigorous prioritization function that apply every day. And, you know, I think, like, everyone's going through a different world with the pandemic as well and then working you know, shifting their lives around that. And, you know, that's been an interesting thing to do while also dealing with, you know, a new company and this a baby that was born 2 months into the pandemic alongside the other set of my family members. So, like, it's been a very interesting time and tons of challenges, but couldn't imagine myself doing anything else.
[00:59:21] Unknown:
Congratulations on the new arrival and on the work that you've been doing on the business.
[00:59:26] Unknown:
Thanks, Devise.
[00:59:27] Unknown:
And so for people who are looking at a system for being able to store and analyze and alert on their metrics, what are the cases where Chronosphere is the wrong choice?
[00:59:38] Unknown:
I would say that right now, a lot of folks that are working with us deal with, like, a minimum volume of, like, tens of thousands of metric samples per second. So, usually, you know, if anyone that's kind of, like, running at 10, 000 or less right now would probably find it hard outlying for very significant large growth right around the corner. They obviously probably have a challenge looking at us, but, you know, I would say that, like, the way that we view things that a lot of cloud native and the way cloud native applications and systems are being built best work with Prometheus like metrics. And while there are plenty of other existing vendors out there like Datadog and others that are well established, you know, I think, like, using cloud native technologies such as Prometheus and and Kubernetes, you know, makes a lot of sense, and we're obviously 1 of the more compatible with that. And so if you're kind of, like, yeah, in that stack and you're experiencing growth in, you know, anywhere from 10, 000 metric samples and up, it's definitely worth coming and having a chat to us.
[01:00:49] Unknown:
And as you continue to build out the business and the platform, what do you have planned for the future?
[01:00:55] Unknown:
We're never gonna be finishing my mind with the solving, monitoring, you know, like, a constantly evolving complex world. You know? And I would love to get into the space of, as well, being able to go from a data point on a graph to a a line of code and explore the very different things that we're like going on and with that code path amongst all your applications, everyone, how that's related to other code paths. Like and I think that source graph is a fascinating tool that has a great reverse index on, like, code symbols. And, you know, I'd love to see monitoring go to a point where we're able to both be really intelligent about what you're doing and give you Google Now, like, kind of suggestions on, hey. We noticed, like, your database has a lot of open connections to it for the kind of request rate you're doing. Do you need that many connections open? Because that could, you know, impact performance.
So a lot of these, like, more kind of, like, deeply integrated, contextually aware kind of features in the monitoring space along with just being able to actually be much more well integrated into the way that you work your code and and your systems in general. Both those areas, I think, will be a large area of investment for us over time.
[01:02:11] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:02:25] Unknown:
In terms of, like, data management today, I really think about it as the user experience aspects of both producing and then kind of, like, viewing and harnessing that data. So I think that some of the largest problems with it is this very, like, cookie cutter aspect to how people just do that task in general. I think that schema free metrics is really, like, empowering on the producer side. But then when it comes to actually harnessing that data, there's still leaps and bounds to go. It should feel like a magical experience, and there's no reason it doesn't need to be. It's just that much like tools have been progressing every 5 or 10 years, our systems are getting finally able to, like, move huge amounts of data around. You know, Snowflake's able to obviously do a clone of an entire table in seconds now. I think there will be more movements on features like that as a consumer of the data as well as a producer to really be able to manipulate and transform and categorize and more natively think about the state and track it, there'll be huge improvements in this space.
[01:03:39] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Currentosphere and on the m 3 database. It's definitely a very interesting problem domain and 1 that, as I had mentioned, I'm very close to in my day to day work. So thank you for all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Likewise. People listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Rob Skillington: Introduction and Background
Early Career and Experiences at Microsoft
Challenges in Monitoring and Observability at Uber
Building Chronosphere: Motivation and Goals
Handling High Cardinality Metrics
Chronosphere Platform Architecture and Features
Use Cases Beyond System Metrics
Low Latency Alerting and Monitoring
Future of Monitoring and Observability
Interesting Use Cases and Lessons Learned
Future Plans for Chronosphere
Biggest Gaps in Data Management Tooling
Closing Remarks