Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.

Go to data engineering podcast.com/linode,

that's

linode,

today to get a $20 credit and launch a new server in under a minute.

And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence,

ODSC,

and Data Council.

Upcoming events include the software architecture conference, the Strata Data Conference, and PyCon US.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today.

Your host is Tobias Macy. And today, I'm interviewing Vadim Semenov about how data engineering works at Datadog. So, Vadim, can you start by introducing yourself? Hi. Hi, everyone.

Thank you, Tobias, for inviting me to speak at, on your podcast.

I've been working with Hadoop and Big Data for the past, like, 7,

8 years. And previously, I was working on scaling distributed,

systems.

And with Datadog, I've been working for the past 4 years as a software engineer slash data engineer,

and I was responsible for lots of things that we've built there. And do you remember how you first got involved in the area of data management? Yeah. Definitely. I remember. So the first company that I was working

at, it was, like, edtech company.

And we were having, like, lots of,

problems

regarding, like,

clients and websites where we're showing ads, and we had to, like, provide, like, consistent proofs, like, figuring out, like, different issues with, like,

who saw our ads,

where we've shown it, and how we compare it, and where is the fraud, and how

everything like,

most of life was figuring out, like, fraudulent traffic, and there was, like, a lot of logs involved. And there was, like, a system where there was built by some different team that I had to dig deep in, and it was built on, like, Hadoop Hive.

Hive was, like, 0 point I don't remember the version, actually. It's, like, 0.6 or something like that. And Hadoop was also 0 point something.

I was riding on on AC 2. I don't think, like, we had EMR back then,

but that's how I got involved. And, like, I was, like, really,

interested in the system, and I was started digging, like, Hadoop. And, eventually,

I thought those responsibilities were

handed over to me to,

like, overall, like, scale the system and, like, push it forward.

So that's how I got involved into the whole, like, big data system.

So you got thrown into the deep end and just had to figure it out as you go as do so many of us in engineering.

Yeah. That was, like, a pretty interesting time. Like, there were not a lot of resources about Hadoop. And Hive was Hive was, like, kinda, like, first, like,

tool easy to manage for, like, lots of engineering, like, because you basically write SQL and, like, it runs. But it was, like, all on, like, Hadoop m r 1 model,

which wasn't super scaled compared to yard.

And there were, like, lots of tricks around that.

And so 1 of the other things that I think is interesting is the fact that you mentioned that your title is software engineer in data. And I think it's interesting

the distinctions that are drawn between software engineers and data engineers and where the sort of primary focus or primary responsibilities

lie. And so I'm wondering if you can talk a bit about

the ways that your role is sort of defined and the types of things that you're working on and thinking about at Datadog, and if there are any

specific roles at Datadog that are more on the data engineering side and sort of where those distinctions and boundaries lie. So that's an interesting question. So over the years, like, the term, like, data engineer

appeared, like, maybe, like, 10 years ago or something like that.

And I was following it closely, and it was, like, really,

software engineering heavy. Like, people were working on, like, lots of scaling problems and overall.

But over time, the term, like, change its meaning. And then we see it,

in a lot of companies, data engineers, like, defines, like, loosely now. So at Datadog,

we started, like, moving a little bit away from it and for every candidate and overall. We tried to explain, like, what we mean as a data engineer.

And we we do actually sorry. We don't actually mean, like, what we mean by data engineer, but we explain, like, what kind of data engineering problems we solve.

And in our case, data engineer is not some someone who just implements, like, pipelines and,

writes codes code for analysts and data scientists. It's someone who tackles, like, the most challenging problems,

and builds, like, building blocks for other teams.

And, overall, at Datadog, we for example, when our data science team

was working on anomaly detection, they had to build models using 30 days of data, and they started breaking our query cache.

And they

had to rebuild, our query cache from Redis

to Cassandra, which is not super

typical for data scientists to work on. Right?

They got obviously help from DevOps, but they wrote, like, some chef cookbooks,

that we used before,

and overall

deployed the system because it was their feature to deliver. And the same, applies to all other engineers that we try to hire. So we try to make sure that a person is not just an expert in their

domain, but can also go, like, to different different fill and figure out, lots of things for themselves. And that's also shapes, like, how we approach data engineering. So it's not someone who just, like, gets requirements from other teams and builds pipelines and

does stuff. No. Our analysts and data scientists, they all build their own pipelines. They write tests.

They manage their own clusters

and and so on.

So data engineering, like, that's why, like,

it's like compared, like, to,

overall of the industry concept of that, we try to, like,

explain a lot about, like, what we basically mean by that. It's not gonna be just, like, writing SQL queries. It's not gonna be just, like, wrangling some pipelines.

We we have to go to

some parts of our query code. We have to optimize JVM. We have to, like, figure figure out, like, how memory

works inside of our own computers and and so on. We we work with binary file formats. We read data from Kafka.

We manage different systems. We think about ops and lots of other things.

So in a lot of ways, it seems like you could kind of describe it as the SRE model,

but applied to data problems where you have somebody who is

a

quote, unquote expert in a broad variety of systems that acts as sort of a an enabler and a consultant to all the other teams who are focused more on the product engineering side of things to help them make sure that they have what is necessary to run what they're trying to build.

And you're just trying to build out the

foundational layers and the building blocks that help them do their job and help them understand how it all fits together.

Overall, yeah. Like, because, like, our our company is geared toward

other engineers and DevOps and SREs and other people.

This approach,

is in is in the core of, like, everyone who we, like, try to, like, bring to data.

But, like, we have, like, different teams that deal with different types of data. So we have, like,

metrics intake team that deals with high volume of data. We have analytics team that deals with, like, lots of variable data.

And we have, like, revenue team that has to provide, like, make sure that the numbers are correctly calculated.

So there's, like, different challenges for all those, like, quote, unquote, data engineers, if you will.

And in different teams, we have, like, different requirements. So it's not just like but in the core, we try to make sure that everybody's, like, kinda SRE.

We who someone who can deal with, like, ops, monitoring,

making sure that, the system is fault tolerant, resilient, and and so on.

And before we get too much further, can you just give a bit of a description about what Datadog is for anybody who isn't familiar with it and some of the types of data that you're dealing with and the types of scale that you're working at? So Datadog is a monitoring service for all kind of data, starting from,

ground level, like hardware monitoring, how your CPU is loaded, how much memory you use, going to ARPA level, like web servers or database layer, how many queries you make,

how many,

500 errors you throw

to completely application level where you can see, like, what kind of queries are slow, why they're slow, and so on. So we try to cover a huge range of, like, problems that a typical engineer

would face, and we help companies to know, like, if they have issues, where the issues appear, and so on. And the data that we deal with is pretty variable,

but we can define, like, categories. It's metrics,

basically numbers.

We have application performance monitoring where lots of, like, numbers and traces

are typed together

and logs, which is your basically

text data.

And the volume of data,

I'm not sure if I can disclose, but it's, like, in terms of, like, points, it's in tens of trillions per day.

So when you're working at that type of scale and you're dealing with time series data that is being surfaced

to other engineers and operators for their mission critical infrastructure. There's definitely a high requirement for

reliability and uptime of your

infrastructure. And so I'm curious what types of components you're relying on for the foundational platform for managing the ingestion and analysis

and surfacing of that data, and some of the types of challenges that you're dealing with, particularly because of the volume and the high uptime requirement?

Lots of people ask me this question, and they think, like, that we use, like, some kinda

standard tools. But the only thing that,

is, like,

standard for us is, like, Kafka. So most of data we get and process

gets put back,

in Kafka, and we have, like, lots of different services that consume

Kafka. And Kafka, we also, like we spread across different data centers.

And we try to make sure that different customers

get grouped together and so on.

So on Kafka level, we have some

resiliency,

fault tolerance because that's the backbone of, like, a lot of things that that we do. And for different services, we have, like, different, like, completely different clusters of Kafka.

And out of that, we have, like, 2 distinct groups of consumers.

The first 1 is real time consumers,

which mostly drive, like, the last, like, 24 hours of data,

lots of alerting that we have. And they don't have to store, like, this data,

in historical plan.

And inside, they're storing, like, completely custom built data stores. And in some places, we use, like, CrocsDB, but it's like embedded database, so

not a lot of things. We have some Cassandra,

which is we use for Coricash,

as I said before.

And and the, and the other, group of consumers that we have is historical.

So from Kafka, we also consume

data, and then we write it to in custom file formats

to call storages. And then the query system can easily get data either from live system or historical system.

That's how we show data overall.

I find it interesting that you're working in a custom file format, particularly for some of the historical data where I'm sure that there are a lot of benefits that you gain as a virtue of having them built for your specific use case, but at the same time, it adds a bit of extra burden for

onboarding engineers because they have to learn a specific tool rather than being able to translate information or experiences that they've had from other companies.

And so I'm curious what your experience has been on those lines as far

as any friction that is caused by having

more custom tooling where you may have been able to take advantage of something off the shelf for a slightly less optimal

performance or capabilities?

Yeah. So the goal of those, like, custom file formats is to

allow customers to see data as soon as possible. So whenever,

on a dashboard you open

and

want to see, like, for example, 7 days of data or, like, 1 month of data. That's a lot of data points, and we have to show them as fast as possible. And we try to make sure that, like, within a minute, you can get data. So those file formats are really optimized for reading data.

And

all for engineers,

as long as you have, like, tools. So for example, it doesn't,

from outside perspective, for it shouldn't matter, like, if it's, like, Parquet or, like, some other file format as long as you get data, like,

with a certain schema. Right? You don't really care, like, as long as, like, that schema is the same.

And we have, lots of tools for that. And overall,

there are, like, some challenges because of the system is so big. There is, like, lots of, like, moving parts, and it's not just like a single file. There's, like, lookup tables, index files, etcetera, etcetera.

But we've we decided that

that's the way to go, unfortunately.

We have some, for some intermediate data. We also use Parqet.

So we're not we're not only just using,

our custom file format. So, for example, for analytics purposes

and for, other

more flexible use cases, we've figured it out that we actually want store in parquet, and we do store in parquet. However, we apply some higher order aggregations there. So we don't store all the volume of data in parquet.

And for and in terms of data retention,

we store

everything with 1 second resolution

for

over 15 months, and customers can request, obviously, like, longer periods.

But, essentially, like, our system is optimized. Like, if you want to see, like, what was happening in your system last Black Friday,

for example, to forecast, like, what's gonna happen this Black Friday, You can easily go on dashboards and check what was happening

down to a second

in your system, and that's what, like, historical data is

useful for.

Yeah. Being able to maintain that 1 second tick resolution for such a long time frame, I'm sure, has some storage challenges. I'm wondering what the compressibility

of the information is given that it's all working along discrete time intervals, and there might be some

similarities in the patterns for being able to compact the actual files on disk. Yeah. So we're we're not doing, like, extra, like, research, extra rigorous research about, like, how we can compress

all the data because, like, as you said, like, data is so,

variable, and

there are, like, different,

different techniques. Right? If you were looking at, like, Parquet, like, they have, like, 4 different thing techniques of encoding, like, data. But we don't do that, and we, achieve about, like, 15 x compression ratio.

So whenever you seal you look at raw data it's probably actually even more because I was comparing,

compressed data from Kafka, and then we group data and compress the actual time series. So, in reality,

it can be, like, close to, like, I don't know, like, 25, 30.

And then another interesting piece is because of the fact that you're using these Kafka clusters running across multiple different data centers for ingesting customer data, I'm curious how you handle

performance

and time to alerting for people who are sending you metrics

and understanding sort of what the locality is of the origin for being able to

a

So if your customers are probably gonna live in, like, a certain cluster unless you, like, have certain agreements with us.

And that's why, that's how we avoid lots of, like, intro traffic. Right?

Because, like, cluster's kinda independent. And within our,

systems,

we

can quickly switch

different parts of our systems between clusters,

and we try to not man maintain state as much.

So for example, for historical data, we only, like, keep last 10 minutes.

And whenever we switch to a new cluster, we just need to replay, like, last 10 minutes of data.

So we completely we don't need our Kafka clusters to talk,

between each other. But within,

within,

a region,

we obviously deploy,

Kafka clusters

in multiple availability zones,

and that's where we get lots of traffic. Unfortunately, I don't really know all the details because it's, like, falls not outside of my work. We have, like, completely separate teams that's work on Kafka. They're called, like, data reliability engineers.

Yeah. So that's a good opportunity to talk a bit more about what the team structure looks at for the people who are working closely with the data at Datadog, and what your particular responsibilities are, and how you work within the organization, particularly given the fact that Datadog has grown in size pretty significantly over the past few years. And so just sort of how you

coordinate

the projects that you're working on across

the team boundaries and across geographical boundaries.

Yeah. So

first of all, like, when I joined 4 years ago, Datadog, we were a 150

people company.

And,

over 4 years, we've grown to

1300 people.

So we

8

hex growth, I guess.

It put, like, lots of, like, pressure on, like, how we'd structure teams and how we do work overall.

At first, like, we had, like, pretty flat organization, and now it's, we add layers.

And we had, like, people who, like, gather requirements, like product managers, directors

from team leads, and we compose, like,

objective key results for every quarter, and we have sprints,

obviously. But besides that, we do a lot of documents. We started doing lots of documents.

So whenever, like, we're working on some big project,

actually, like, no matter what what kind of project, like, it's all, like, case by case, but we created, like, a document where we request comments and lots of different teams can come and

comment, and we request from different teams

what do they think, how it should look like, how it's gonna work with,

other systems and other teams. Obviously, there it's not super perfect. There are still problems.

There's still some challenges, but we're working on them and, like, as any probably, as any growing company,

we're experiencing the same issues, and we're trying to overcome those.

I hope I answered your question.

Yeah. No. That's definitely good information. It's it's always interesting seeing what the team dynamics happen to be and what the breakdown is of responsibilities

across different organizations.

Because

in in broad strokes, we're all working with data. We're all doing what looks to be the same thing at a high level. But as you get closer in, there are a lot of different ways that people

break down the responsibilities

and what the main areas of focus are. And it's interesting how the specifics of the business

influence or dictate

what those boundaries happen to be. It's usually, like, when we start we we start, like, a new project,

and, like, it it is all within a team. Like, couple of people are working on it, and over time, it grows to, like, a bigger project.

And then, like, out of this, a new team gets born. Yeah. But that's usually the dynamics of how we break things.

Usually, like, some small project that has grown to

a certain size and

people break

into new teams.

The funny part about that is, like, how we handle ops.

That's because, like, you have, like, a big system and you have people

and, like, lots of people being, like, kinda rotation

to support the system. And once you break it apart, like, different teams, like, who's responsible for what,

that kind of becomes a little bit interesting to figure out.

And then in terms of

the work that you're doing, who are the main consumers of what you're building? And how do you work in feedback cycles to ensure that their needs are being met and that you're meeting the feature requirements

for the types of systems and primitives that they require to be able to get their job done?

So I work in kind of between, I, deal with

receiving data directly from Kafka

and storing it in custom binary file formats and then also providing a different formats to other teams. So for me, we have,

both, like,

customer facing features and internal

SLAs that we have to support between, like, for example, revenues, team analytics.

So,

up until, like, I would say, like, maybe, like, half a year year ago,

lots of requirements were coming from how, like, we operate the poll, like, product and pipelines that we've built and how well it can scale.

Lots of things that we've built early on were not,

built in mind with, like,

such a high growth in mind,

and they were started breaking apart, and

that's what mostly we were working on for the past, like, 3 years,

in, like, at least, like, in my team teams.

And now we are going into territory where, like, we fixed most of the, like, scaling issues, and we don't really have them. And we actually we were building systems with, like,

10 x growths, which should be enough for another 3 years. And now we can concentrate on other things, other requirements that

now come from different parts of the product.

So for example, we recently,

released,

histograms

for,

for which our data scientists

developed a new sketch algorithm,

and we're trying to make it available for historical data as well.

We also released SLI SLIs,

for our metrics,

and we're also trying to make it work for historical data as well. So now it's more into real of

bringing

features that we have in, like, life systems to

historical data. And lots of people need it, and we also have lots of developing in the data science front where we want to apply machine learning, not just to live data, but historical data on on larger periods of time.

And I know that machine learning is often something that's challenging to be able to run-in a production

context because of the fact that there can be model drift and the there there's not really a easy way to do deterministic testing of the model to make sure that it's operating optimally. And so I'm curious what types of challenges that you're dealing with or platforms that you're leveraging to be able to handle those machine learning models in a production environment for being able to do those historical analysis?

So,

I'm

I'm not really,

aware about all all the things in the machinery

offering that we have. So I might be wrong about some parts. I'm sorry.

But overall,

the main challenge there

is

how you create a general approach for

all kind of different time series data that customers have.

How you generalize that.

And there's only so many things you can do. You can try to build and, also, like, how you can make sure that it's, like, runs fast.

It doesn't have lots of false positives,

and so on. And

we released, like, first, like, anomaly and outlier detection algorithms, like, 3 years ago, and we saw some,

adoption.

But personally, for me,

I tried it.

It's difficult to interpret models. You know? Like, when it shows you that you have, like, a alpha and anomaly here, you're like, why is that? Like, how, like, why model decided

it's, like, a a problem. So

I personally, like,

not sure about how I feel, and I'm actually not sure how, like, overall

this problem can be solved. We have, like, other offerings, for example, Watchdog,

where we try to find related stories. So, for example, whenever, like, you have spike in your database connections, we can relate it to some certain, like, other time series and show you that maybe those are related, and that's where you should look at. We also have, some other machine learning capabilities such as, if you don't have patterns in logs, which is really helpful when you have, like, a constant stream of logs and you want to figure out what's what's happening often and what happens rarely. But in general, like, those, like, applying machine learning on such a big scale, it's really difficult. And companies who have their own, like, things that they have they have, like, internal monitoring solutions, they can really flex there

and because they can, like, build, like, certain

models that really

tied close to, like, their problems.

Yeah. It's definitely interesting

in your particular case of building these models that are feeding into some of the decisions that other operators are making based on the anomalies that can get surfaced and being able to

explain in a quick and accessible manner where those decisions are coming from,

and what types of actions might be useful, or what types of insights you're trying

to convey to them so that they can be able to make the necessary actions or determine whether an action is even necessary at all. Yeah. Definitely true. Overall, like, that's, like, overall, the goal of the the industry and, like, 1 of the things where I joined Datadog is to I wanted to help. Like, we have, like, so much data

and,

being able to correlate it and figure out problems early on would be really nice to have, but it requires, like, tons of tons of computer resources,

fortunately. Oh, fortunately. I don't know. And then the other interesting thing about the problem space you're in is that you're reliant on the end user

delivering data to you and its time series, which there are always issues about the order of deliverability where you might be having somebody who's sending data out of order, or they might have agents that are having networking issues that will maybe bunch up a bunch of data and then deliver it all at once, but deliver it late. And then there's also the fact that because you have end users sending the information,

they're as long as they're using your agents, it's likely that the schema is going to be accurate, but they might decide to start sending extraneous information or misformatted data that doesn't match the schema that you're anticipating. And so I'm curious how you deal with all of those issues of data cleanliness

and some of the challenges that are specific to time series. Yeah. So the first,

the first front is figuring out if data is small form or not, and we quickly

filter it out if it's small form on our end.

And about late and future arriving data, we have pretty,

lose windows. So

everywhere we we we say that, we accept 5th, up to 15 minutes in future. Like, so if your point came with the time stamp that is in the future, we'll accept as long as it's not farther than 15 minutes. And for later arriving points, yes, as you said, like, sometimes, like, service or host, like, are slow. There's, like, some bottleneck problems, or you can send, like, custom completely time series. We accept up to 3 hours

in the past, so we have to wait until we can actually start processing. Yeah. And I know that that can be challenging

for the application logic that you're dealing with because you're trying to maybe

build some windowing or build some insights into the stream of data as it's coming in. And then all of a sudden, you get a whole another batch of data points from 2 hours ago that you then need to recompute some correlations between

2 different metrics that are coming in. And so I'm wondering if you have had to deal with that in your own work as far as being able to

realign the data. Yeah. So,

because, like, we wait, like, 3 hours

in the past and 1 hour in the future, we we we start data in 4 hour chunks, and, basically, we have to use, like, 8 hour windows.

And 8 hour windows

kinda overlap,

and then we have, like, other problem. We also have, like, orgs that can migrate into different charts, chunks,

etcetera, etcetera. And

tying all this data together

is a big problem,

and we've been working hard on it to figure out, like, how to do correct correctly migrations. So for example, if,

an organic a customer grows,

too big for a certain chart and we move it to a new chart, how it's gonna work with overlapping data, how it's, like, late arriving data is gonna go, and so on. I wouldn't say that it's like we solve all the problems. We that's that's that's a lot of thinking went into it. And the problem is, as you said, yeah,

lots of systems also use the same dataset, and they also have to have all those capabilities.

We define some strict rules about doing such migrations,

and we try to follow. And, overall, like, we we still in trying to solve all those problems with, like, overlapping

data,

time windows. And you've been there for a few years now. I'm curious what you are most proud of or what have been some of the most interesting projects that you have been engaged with. And out of those, any of the lessons that you have found to be particularly valuable or unexpected

or just interesting

issues that you've had to confront?

Yeah. I have stories.

So, initially, when I was hired, I was hired, like, to bring Spark support

to Datadog.

Overall,

we, it use we used to use

big as our process and framework,

and bringing Spark support was pretty challenging in terms of what I had to,

we had inter we have internal platform that we use to launch jobs, like, kinda like Q Ball Databricks. And I had to, like, write, like, lots of code there, like, figure out, like,

optimal

settings for all the spark clusters,

then write bunch of tutorials, figure out how we're gonna compile code, how code is gonna be delivered, how it's gonna

signal our workflow management framework, and that work is done, and and so on and so on. There are lots of challenging aspects about that. And then once it was built, like, we started, like, moving, like, some historical

processing toward it, and then the challenge was, like, how we use Spark.

And back then, it was Spark 1.6,

Spark 2.0

was still in development, and we're we're pretty hesitant to use it. And Spark 1.6, we thought, like, yeah, it's it's great. It's reliable. It's stable, but turns out it's not. Like, there are, like, lots of places where it crashed. And over the past, like, 4 years, we've grown,

some expertise around Spark, but it still throws, like, some interesting problems

on our plate. And then, like, while we were working on, like, different, like, systems, we had to dig

deep into, like, how JVM works, how memories outlined, how does word aligning works, how much space our data structures take, how garbage collection works, how how to put everything in off heap and and so on and so on. And we actually found, like, some bug, in Scala itself and Spark.

And we helped to we we didn't help to fix it. We just noted that, yeah, actually, arrays can be bigger than this number and so on. So there are, lots of things that I learned is that, for example, for Spark, turns out, like, not like, at the scale that we use, not a lot of companies use it, actually.

And lots of, like, problems we've,

ran into, nobody have seen before.

And we had to, like, figure out, like, oh, how are we gonna do this and that? And on top of that as well,

we run most of our jobs on spot instances, which means that,

your cloud provider can take them away at any time. And this creates, like, such a violent environment.

And when you have, like, clusters of, like,

5, 000 cores and your instances are constantly dying and Spark runs on it, you realize that, you're probably not everyone's doing that, what we're doing. And whenever you have a problem, you're like, okay. I'm on my own. I need to get all the strike traces, all the logs, and figure out, like, what's up. That's

that's just a small portion of problems

we've run into. And the last 4 years have also been an interesting time in terms of just the overall shift in direction for both data platforms

and operations and infrastructure platforms,

particularly with the rise of

Kubernetes and just overall container based systems

and the proliferation

of

cloud platforms and cloud environments and different services. And so I know that Datadog also supports a number of different direct integrations with things such as Amazon or other cloud providers or third party SaaS platforms. And so I'm curious

what that overall

shift in the technology landscape has meant for you and your work at Datadog and the types of requirements of the systems that you're trying to build? Yeah. Definitely. So overall, like, as you said, like, the cloud is still growing, and then there is, like, 2nd wave, like, where everyone is moving to containers

and Kubernetes and everything. And from my perspective, for example, 1 of the biggest problems that we had to face is that containers live so short period of time. And for each con like, you're gonna have so many container IDs,

and ours some of our systems were not built in place for, like,

churn of, like, tags, basically, and we had to fight that. And it creates, like, different challenging problems. The other part is, like, we,

we made a huge push to move in all the services in Kubernetes.

And as part of,

my job, I had to move lots of services to Kubernetes as well. And that's not always part of, like, what data engineers do, but that's like yeah. That was my problem. We had to do it at Datadog.

So I got a hands on with, like, Kubernetes.

Overall, like,

we use Kubernetes extensively at Datadog, and for some more services,

there are, like, some challenges.

But also

it's still for me, it's difficult to judge, like,

how easy it is to use for, like, a ordinary engineering,

an ordinary

a typical engineer, but,

like, better the benefits of Kubernetes, I can see that, and I can see that I can see the the platform prevailing. And so far, like, we've been releasing lots of reports about cloud adoption,

container systems adoption,

and we've been showing that Kubernetes

is growing and Docker is growing as well. So that's just like like the the life we have to live with. And as you look forward to some of the projects that you've got planned for some of the coming months years, I'm curious,

what are some of the types of technologies

or best practices

or overall patterns and systems designs that you're trying to keep an eye on or that you're hoping to adopt and just some of the overall types of challenges that you're anticipating as you move forward? Yeah. So we at Datadog, we're not really,

hyped about different technologies.

We we've we've done some of them. We've we saw a lot lots. Some problems and as, like, at our scale,

picking up a technology

is a pretty,

we we should be conscious about how we approach that. So we don't really, like, put lots lots of, like, thoughts like, lots of, like, figuring out what kind of technologies we're gonna use. Instead, we put, a lot of efforts in, like, how we approach engineering and built in fault resilient

fault tolerant resilient systems is where,

we want to be. And overall, like,

overall, like, the problem problems that we're gonna face next,

from from my perspective,

are gonna be related to that. And in terms of the actual 1, it's probably ops.

So the number of services is keep keeps growing,

but the number of engineers is not growing as fast. And

overall, like, a human can only have, like, keep so many so many things in their their head.

And you want to make sure that ops is, like, automated, so we need to make sure that we are building a system that can auto recover, can have retries, can have proper

login, monitoring, etcetera, etcetera. We can replay data easily

and so on.

And the other part, which is gonna be huge for us, I think, is migrations.

So at our scale, like, when we develop new systems, we can just

roll them easily. We have to run them for, like, certain period of time, like half a year, for example. And when I when you run huge systems

on such scale, they

burn lots of lots of money. So you have to figure out, like, instruments that will allow you to move some data between

those 2. And then

on the other hand, on the other hand, you have systems that consume those data. You have to make sure that,

those systems work reliably with partially migrated systems.

And that's

that problem,

we haven't seen, like, been solved. Like, there's, like, not a lot of, like, guidance,

and we have to, like, develop our new techniques about that. Are there any other aspects of your work at Datadog or the types of projects that you're building or the platform in general that we didn't discuss yet that you'd like to cover before we close out the show? We have, like, some internal projects about, like, Spark performance,

monitoring.

So

I feel like

lots of companies, lots of people around, like like, their jobs, like, lots of pipelines,

but nobody tracks, like, how performance of those jobs, like, degrade or improve over time or with certain, like, code pushes, etcetera, etcetera. And we're trying to in our case, like, we have, like, hundreds of jobs that we run, and we're trying to see, like, how certain changes

improve

or break, like, jobs.

And we're working on it, but I'm not sure, like, how and when we're gonna release it. And we've researched, like, other open source projects, and we haven't yet seen like

this.

And I had, like, some conversations with other engineers without with other companies. I'm like, I feel like that's, like, 1 of the problems that,

data engineers have

overall in, like, businesses. Like, how how

can you measure how efficient your jobs are? Yeah. That's definitely a challenge because of the fact that there is such variability and seasonality of the data that you're

execution time of a job at 1 point in time isn't necessarily indicative of its overall performance because it's highly dependent on the data it's working with.

So unless you have very consistent datasets that you're processing on a regular basis, it's definitely difficult to be able to say with any high degree of certainty whether a particular tuning is having the desired impact without being able to measure it over an extended period. Yeah. Definitely. So we we not just collecting, like, spark metrics or YAR metrics or, like, system metrics. We also, collect data from the jobs itself. And you have the same code, but you can run, like, as you said, like, on different, like, data inputs, different, like, time periods, with different parameters. And it the same code will be executed, like, different in different,

variations

on the same hardware. And we're trying to, like, figure out how, all those parameters,

relate to each other and whether we actually need to improve our jobs or we are fine. Because, essentially, like, the basic question that we want to answer,

like, a couple. Like, is it is our our,

will our jobs work for the next, like, 3 years, and how much money they will burn,

and when we should start

looking closer at that? Because cost optimization is also a really huge part of, like, what we do. Yeah. It's definitely an interesting problem space. For anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. That's really good question,

Thomas.

See, like, I'm not super, like, in data management.

The space of the data that I work with is pretty limited. This volume is big. But from what I've seen is, like, workflow management is still, like, kinda not solved. We have, like, airflow,

which still has some certain problems.

We have, like, different other workflow managers and, overall, like, rescheduling

pipelines,

rerunning, reprocessing data,

waiting, alerting is like, I I haven't I am yet to see, like,

a a product that solve, like, workflow management to, like, complete degree. Yeah. I I agree that the workflow management space still has room to grow, and it's interesting to see some of the generational

approaches to it where,

you know, maybe 15, 20 years ago, we had things like SSIS,

and a lot of the GUI driven tools such as the,

Jasper soft suite

or, what came out with things like Pentaho. And then there was the sort of second generation,

which were things like Luigi or Airflow where we were moving more towards the workflow defined as code and being able to have it in a more software native approach. And then a lot of that was a bit

too rigid in terms of the way that it was defined and executed. And so now we're starting to hit the 3rd generation where we have tools such as Daxter and Prefect and Kinto that are trying to blend the requirements of data engineers and data scientists,

and there still seem to be some rough edges or

incomplete

execution of some of the overall requirements for this space. And then you've got things like Apache NiFi

that are trying to revisit the GUI oriented type of thing, but work in more of a

data flows type paradigm. So it's interesting to see a lot of the different generational

and paradigmatic

shifts in terms of how people are trying to deal with this as we deal with more data in terms of volume and variety and the different environments that we're dealing with it in and the types of consumers that are trying to be able to gain value from it. Yeah. Absolutely. I totally agree with you. And the other the other 1 that I was thinking about is that I get, like, I go to different conferences. I present in the conference about, like, things we've built with Spark, and lots of people, like, approach me and, start asking questions about, like, why my job is so slow. And if you go over, like, Spark, like, user release, you also, like, get, like, these kind of questions. And, this, like, map reduce model that we, that's been around, like, for 15 years or, like, I don't know, 20.

We still have we use it and,

the people there's none lack of understanding, like like,

why is that, like and still people still, like, puzzled about, like, why jobs are slow. And there's, like, no solution so far. Like, even Databricks haven't, like they don't have, like, magic wand that's gonna show you, oh, this is why your job is slow.

So there's, like, still,

room for, like, building expertise in that field.

Well, thank you very much for taking the time today to join me and share your experiences

of working at Datadog and the types of challenges that you're facing there. It's definitely an interesting platform that you've built and 1 that I use for managing my system. So I appreciate all the work that you've done, and I hope you enjoy the rest of your day. Okay. Thank you, Tobias. Thanks for having me, and I wish you all the best. And everyone, have a nice day.

For

listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links