Building The DataDog Platform For Processing Timeseries Data At Massive Scale - Episode 113

Summary

DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with?
  • What are the main components of your platform for managing that information?
  • How are the data teams at DataDog organized and what are your primary responsibilities in the organization?
  • What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?
    • What are some of the strategies which have proven to be most useful in overcoming those challenges?
  • Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met?
  • Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information?
  • Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered?
  • What are some of the upcoming projects that you have planned for the upcoming months and years?
  • What are some of the technologies, patterns, or practices that you are hoping to adopt?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage to the worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Vadim Semenov about how data engineering works at data dog. So buddy, can you start by introducing yourself?
Vadim Semenov
0:01:46
Hi, everyone. Thank you, Tobias for inviting me to speak it on your podcast. I've been working with Hadoop and big data for the past like seven, eight years and produce I was working on scaling distributed systems and with data doc have been working for the past four years as a software engineer slash data engineer. And I was responsible for lots of things that we've built there.
Tobias Macey
0:02:13
And do you remember how you first got involved in the area of data management?
Vadim Semenov
0:02:16
Yeah, definitely. I remember I saw the first company that I was working at, was like, ad tech company. And we were having like, lots of problems regarding like, clients and websites where we showing ads and we had to like provide like consistent proofs, like figuring out like different issues was like, who saw our ads, where we shown it and how we compare and where is the fraud and how everything like most of life was figuring out like fraudulent traffic and variables, like a lot of logs involved. And there was like, a system where there was built by some different team that I had to dig deep in and it was built on like Hadoop hive. hive is like 0.9 Remember version actually 3.6 or something like that and how it was also zero point something was riding on on easy to I don't think like we had Mr. back then. But that's how I got involved in like I was like really interested in the system and I was started digging like Hadoop and eventually I thought those responsibilities were handed over to me to like overall like scale the system and to like push it forward. So that's how I got involved into the whole like Big Data System.
Tobias Macey
0:03:31
So you get thrown into the deep end and just had to figure it out as you go as do so many of us and engineering.
Vadim Semenov
0:03:36
Yeah, it was like a pretty interesting time. Like there were not a lot of resources about Hadoop. And hive was hive was like kind of like first like tool easy to manage for like lots of engineers like because you basically write sequel and like in France, but it was like all only had to model one model, which wasn't super scale compared to yard and like lots of tricks around that.
Tobias Macey
0:04:00
And so one of the other things that I think is interesting is the fact that you mentioned that your title is software engineer in data. And I think it's interesting the distinctions that are drawn between software engineers and data engineers and where the sort of primary focus or primary responsibilities lie. And so I'm wondering if you can talk a bit about the ways that your role is sort of defined in the types of things that you're working on and thinking about at data dog, and if there are any specific roles at data dog that are more on the data engineering side and sort of where those distinctions and boundaries lie.
Vadim Semenov
0:04:39
So that's an interesting question. So over the years, like, the term like data engineer, appeared like maybe like 10 years ago, something like that. And I was following it closely. And it was like really, software engineering heavy like people are working on like lots of scaling problems and overall, but over time, the term Like, change its meaning. And we see it. And a lot of companies data engineer is like defines like loosely now. So a data dog, we started like moving a little bit away from it. And for every candidate on overall we try to explain like what we mean as a data engineer. And we do actually, besides we don't actually mean like what we mean by data engineer, but we explain like what kind of dangerous engineer problems we solve. And in our case, data engineer is not some someone who just implements like pipelines and writes codes, code for analysts and data scientists. It's someone who tackles like the most challenging problems, and builds like building blocks for other teams, and overall deadlock. We, for example, when our data science team was working on anomaly detection, they had to build models using 30 days of data and they started breaking out curry cash, and they had to rebuild our quarter cash from rellis to Cassandra Which is not super difficult for data scientists to work on, right. And they got obviously help from their boss, but they wrote like some chef cookbooks that we used before. And overall, deployed the system because it was their feature to deliver. And the same applies to all other engineers that we tried to hire. So we try to make sure that a person is not just an expert in their domain, but can also go like two different different fill and figure out lots of things for themselves. And that's all shapes like how we approach data engineering. So it's not someone who just like, gets requirements from other teams and builds pipelines and the stuff. Now our analysts and data scientists, they all build their own pipelines, they write tests, they manage their own clusters, and so on. So data engineer like that's why like, it's like, compared to like to overload the industry concept of that. We tried to like, explain a lot about like, what do we mean by Basically mean by that it's not going to be just like writing sequel queries. It's not going to be just like regular some pipelines. We we have to go to some parts of our query code. We have to optimize JVM, we have to like configure figure out like how memory works inside of our, our computers and so on. We work with a binary file formats. We read data from Kafka. We manage different systems, we think about Ops, and lots of other things.
Tobias Macey
0:07:29
So in a lot of ways, it seems like you could kind of describe it as the SRP model, but applied to data problems where you have somebody who is a quote unquote expert in a broad variety of systems that acts as sort of a, an enabler and a consultant to all the other teams who are focused more on the product engineering side of things to help them make sure that they have what is necessary to run what they're trying to build. And you're just trying to build out the foundational layers and the building blocks that help them do their job and help them understand how it all fits together.
Vadim Semenov
0:08:03
Overall, yeah, like because like our our company is geared toward other engineers and DevOps and as stories and other people this approach is in is in the core of like everyone who we like try to like break to date at all. But like, we have like different teams that deal with different types of data. So we have like, metrics and tech team that deals with high volume of data. We have analytics team that deals with like lots of variable data. And we have like revenue team that has to provide like, make sure that the numbers are correctly calculated. So there's like different challenges for all those like, quote unquote, data engineers, if you will. And in different teams, we have like different requirements. So it's not just like, but in the court, we try to make sure that everybody's like, sorry. We have someone who can deal with like ups monitoring, making sure that the system is full, tolerant, resilient.
Tobias Macey
0:09:01
And before we get too much further, can you just give a bit of a description about what data dog is for anybody who isn't familiar with it and some of the types of data that you're dealing with and the types of scale that you're working out. So data dog is a monitoring service for all kinds of data, sort of from ground level, like hardware, monitoring how your CPU is loaded, how much memory is going to our level, like web servers, or database layer, how many queries you make, how many 500 errors you throw, to completely application level, where you can see like, what kind of queries a slow, why they're slow, and so on. So we try to cover a huge range of like problems that typical engineer would face. And we help companies to know like, if they have issues where the issues appear, and so on, and the data that we deal with is pretty variable. But we can define like categories, its metrics. If the numbers, we have application performance monitoring, where lots of like numbers and traces are tied together, and logs, which is your basically text data, and the volume of data, I'm not sure if I can disclose, but it's like, in terms of like points, it's in 10s of trillions per day. So when you're working at that type of scale, and you're dealing with time series data that is being surfaced to other engineers and operators for their mission critical infrastructure, there's definitely a high requirement for reliability and uptime of your infrastructure. And so I'm curious what types of components you're relying on for the foundational platform for managing the ingestion and analysis and surfacing of that data. And some of the types of challenges that you're dealing with particularly because of the volume and the high uptime requirement.
Vadim Semenov
0:10:54
Lots of people ask me this question and they think like that, but we use like some kind of stuff. Tools, but the only thing that is like, standard for us is like Kafka. So most of the data we get and process gets put back in Kafka and we have like lots of different services that consume, Kafka and Kafka. We also like we spread across different data centers. We tried to make sure that different customers get grouped together and so on. So on Kafka level, we have some resiliency, fault tolerance, because that's the backbone of like a lot of things that we do. And for different services, we have like different like, completely different clusters of Kafka. And out of that we have like two distinct groups of consumers. First one is real time consumers, which mostly drive like the last like 24 hours of data, lots of alerting that we have, and they don't have to start like this data and historical plan and inside the storing, like completely custom built data stores. And in some places we use like drugs dB, but it's like an embedded database. So not a lot of things. We have some Cassandra, which is we use for Corey cash, as I said before, and and the other group of consumers that we have is historical. So from Kafka, we also consume data. And then we write it to in custom file formats to call storage is another great system can easily get data either from live system or historical system. That's how we show data. Overall.
Tobias Macey
0:12:36
I find it interesting that you're working in custom file format, particularly for some of the historical data where I'm sure that there are a lot of benefits that you gain as a virtue of having them built for your specific use case. But at the same time, it adds a bit of extra burden for onboarding engineers because they have to learn a specific tool rather than being able to translate information Your experiences that they've had from other companies. And so I'm curious what your experience has been on those lines as far as any friction that is caused by having more custom tooling where you may have been able to take advantage of something off the shelf for slightly less optimal performance or capabilities.
Vadim Semenov
0:13:20
Yeah, so the goal of those like custom file formats is to allow customers to see data as soon as possible. So whenever we're on a dashboard to open, and I want to see like, for example, seven days of data or like one month of data, that's a lot of data points. And we have to show them as fast as possible. And we try to make sure that like within a minute, you can get data. So those file formats are really optimized for reading data. And all for engineers, as long as you have like tools. So for example, it doesn't. From outside perspective, it shouldn't matter like if it's like parkette or like Some other file format as long as you get data like with a certain schema, right? You don't really care like, as long as like that schema is the same. And we have lots of tools for that. And overall, they're like some challenges. Because of the system is so big, there's like lots of like moving parts. And there's not just like a single file, there's like, look up tables, index files, etc, etc. But we've we decided that, that's the way to go. And fortunately, we have some for some intermediate data, we also use parkette. So we not We not only just using our custom file format, so for example, for analytics purposes and for other, more flexible use cases, we've figured out that we actually want to store the market and we do store in market power, we apply some higher order aggregations there. So we don't store all the volume of data in market and for in terms of data retention, We started everything with one second resolution for over 15 months. And customers can request obviously, like, longer periods. But essentially like our system is optimized, like if you want to see like what was happening in your system was Black Friday, for example, to forecast like what's going to happen these Black Friday, you can easily go on dashboards and check what was happening down to a second in your system. And that's what like historical data is useful for.
Tobias Macey
0:15:31
Yeah, being able to maintain that one second tick resolution for such a long timeframe, I'm sure has some storage challenges. I'm wondering what the compressibility of the information is, given that it's all working along discrete time intervals, and there might be some similarities in the patterns for being able to compact the actual files on disk.
Vadim Semenov
0:15:52
Yeah, so we're not doing like x for like, research expert, rigorous research about how we can compress all the data Because like, as you said, like data is so variable, and they're like different and different techniques, right? If you were looking at like parkette, like they have like four different techniques of encoding like data, but we don't do that and we achieve about like 15 x compression ratio. So whenever you see, you look at raw data, it's probably actually even more because I was comparing compress data from Kafka, and then we group data and compress the actual time series. So in reality, can be like close to like, I don't know, like 2530.
Tobias Macey
0:16:36
And then another interesting piece is because of the fact that you're using these Kafka clusters running across multiple different data centers for ingesting customer data. I'm curious how you handle routing of that inbound traffic to reduce Layton sees and ensure that you're getting sort of the optimal performance and time to alerting for people who are sending you metrics And understanding sort of what the locality is of the origin for being able to determine what those optimal routes are.
Vadim Semenov
0:17:07
Yeah, so we usually like, we don't allow, like customers go cross like clusters. So if your customers, you're probably going to leave in like, a certain cluster unless you like, have special agreements with us. And that's why that's how we avoid lots of like, intro traffic, right? Because like clusters kind of independent and within our systems, we can quickly switch different parts of our systems between clusters, and we try to not maintain status as much. So for example, for historical data, we only like keep last 10 minutes. And whenever we switched to a new cluster, we just need to replay it like last 10 minutes of data. So we completely we don't need our clusters to talk between each other. But within within a region. We obviously deploy Couple of clusters in multiple availability zones, and that's where we get lots of traffic. Unfortunately, I don't really know all the details because it's like false not outside of my work. We have like completely separate teams. This work on Kafka, they called a data reliability engineers.
Tobias Macey
0:18:19
Yeah. So that's a good opportunity to talk a bit more about what the team structure looks at for the people who are working closely with the data, data dog and what your particular responsibilities are and how you work within the organization, particularly given the fact that data dog has grown in size pretty significantly over the past few years. And so just sort of how you coordinate the products that you're working on across the team boundaries and across geographical boundaries.
Vadim Semenov
0:18:46
Yeah, so first of all, like when I joined four years ago, data dog, we were 150 people company, and over four years, we've grown to 1300 people. So we ate hex growth, I guess. And they put like lots of like, pressure on like how we structure teams and how we do work overall. At first, like we had like pretty flat organization, and now it's, we add layers. And we had like people who like gather requirements, like product managers, directors from team leads, and we compose like, objective key results for every quarter. And we have sprints, obviously. But besides that, we do a lot of documents, we start doing lots of documents. So whenever like we're working on on some big project. Actually, like no matter what kind of project like it's all I guess, but guess but we created like a document where we request comments and lots of different teams can come and comment and we request from different teams. What do they think how it should look like how it's going to work with other systems and other teams, obviously It's not super perfect. There's still problems. There's still some challenges, but we're working on them. And like as any probably any growing company, we expressing this ambitious, and we're trying to overcome those. help answer your question.
Tobias Macey
0:20:16
Yeah, no, that's definitely good information. It's it's always interesting seeing what the team dynamics happened to be and what the breakdown is of responsibilities across different organizations, because in broad strokes, we're all working with data. We're all doing what looks to be the same thing at a high level. But as you get closer in, there are a lot of different ways that people break down the responsibilities and what the main areas of focus are. And it's interesting how the specifics of the business influence or dictate what those boundaries happened to be
Vadim Semenov
0:20:49
like, when we started we started like a new project. And like it is all within a team like couple of people are working on it and overtime and gross. to like, a bigger project, and then like out of this new team gets born. Yeah. But that's usually the dynamics of how we break things, usually like some small project that has grown to a certain size and people break into new teams. The funny part about that is like how we handle ops. That's because like, you have like a big system and you have people in like lots of people being like kind of rotation to support the system. And once you break it apart of like, different teams, like who's responsible for what, that kind of becomes a little bit interesting to figure out.
Tobias Macey
0:21:41
And then in terms of the work that you're doing, who are the main consumers of what you're building? And how do you work in feedback cycles to ensure that their needs are being met, that you're meeting the future requirements for the types of systems and primitives that they require to be able to get there Job done.
Vadim Semenov
0:22:01
So I work in kind of between I deal with receiving data directly from Kafka and storing it in custom binary file formats, and then also providing in different formats to other teams. So for me, we have both like customer facing features, and internal SLA s that we have to support between, like, for example, revenues, team and analytics. So up until I would say like maybe like, half a year, a year ago, lots of requirements were common from how like we operate the poll a production pipelines that we've built, and how well it can scale. Lots of things that we've built early on. Were not built in mind with like, such a high growth in mind and the restart breaking apart and that's what mostly we were working on for the past like three years. And like, at least like in my team teams. And now we are going into territory where like we fixed most of the scaling issues, and we don't really have them. And we actually we were building systems with like 10 x growth, which should be enough for another three years. And now we can concentrate on. Other things are the requirements that now come from different parts of the product. So for example, we recently released histograms for which our data scientists developed a new sketch algorithm. And we tried to make it available for historical data as well. We also release SLA s allies for our metrics, and we also trying to make it work for historical data as well. So now it's more reom of bringing features that we have in like live systems to historical data, and lots of people are needed and we also have lots of development in the data science front We want to apply machinery not just to live data, but historical data on larger periods of time.
Tobias Macey
0:24:07
And I know that machine learning is often something that's challenging to be able to run in a production context, because of the fact that there can be model drift. And the there's not really an easy way to do deterministic testing of the model to make sure that it's operating optimally. And so I'm curious what types of challenges that you're dealing with or platforms that you're leveraging to be able to handle those machine learning models in a production environment for being able to do those historical analyses.
Vadim Semenov
0:24:38
So I'm, I'm not really aware of what all all the things in the modular offering that we have. So I might be wrong about some parts I'm sorry. But overall, the main challenge there is how you create a general approach for all kinds of different temperatures, data that customers have How you generalize that. And there's only so many things you can do, you can try and build and also like how you can make sure that it's like runs fast, it doesn't have lots of false positives, and so on. And we released like first like anomaly, an outlier detection algorithms like three years ago, and we saw some adoption. But personally for me, I tried it. It's difficult to interpret models, you know, like when it shows you that you have like, I'll fire an anomaly here. You're like, why is that like, how, like, why model decided it's like, a problem. So I personally like not sure about how I failed and I'm actually not sure how like overall This problem can be solved. We have like other offerings, for example, watchdog where we try to find related stories. So for example, whenever like you have spike in a database connections, we can relate it to something Unlike other time series and show you that maybe those are related, and that's where you should look at it. We also have some other machine learning capabilities such as if you don't have patterns in logs, which is really helpful when you have like constant stream of logs and you want to figure out what's, what's happening often and what happens rarely. But in general, like those, like applying machine learning on such a big scale, and really difficult, and companies who have their own, like things that they have, they have, like internal and monitoring solutions, they can really flex their, and because they can like build, like certain models that really tied close to like their problems.
Tobias Macey
0:26:38
Yeah, it's definitely interesting in your particular case of building these models that are feeding into some of the decisions that other operators are making based on the anomalies that can get surfaced and being able to explain in a quick and accessible manner where those decisions are coming from, and what types of actions might be useful or what types insights you're trying to convey to them so that they can be able to make the necessary actions or determine whether an action is even necessary at all. Yeah, definitely true.
Vadim Semenov
0:27:09
Overall, like, that's like, overall the goal of the industry and like, one of the things where I joined data dog is to, I wanted to help like, we have like so much data and ability, being able to correlate a 10 figure out problems early on would be really nice to have. But it requires like tons of tons of computer resources for shuffling of fortune five.
Tobias Macey
0:27:33
And then the other interesting thing about the problem space you're in is that you're reliant on the end user delivering data to you. And it's time series, which there are always issues about the order of deliverability, where you might be having somebody who's sending data out of order, or they might have agents that are having networking issues that will maybe bunch up a bunch of data and then deliver it all at once but deliver it late and then there's also The fact that because you have end users sending the information there, as long as they're using your agents, it's likely that the schema is going to be accurate, but they might decide to start sending extraneous information or miss formatted data that doesn't match the schema that you're anticipating. And so I'm curious how you deal with all of those issues of data cleanliness, and some of the challenges that are specific to time series. Oh,
Vadim Semenov
0:28:24
yeah. So the first, the first front is figuring out if data is small farm or not, and we quickly filter it out if it's small forum on our end, and about late and future Ivan data, we have 14 lose windows. So everywhere we say that we accept up to 15 minutes in future like so if your point came with a timestamp says in future will accept as long as not farther than 15 minutes. And for later, Ivan points Yes, as you said like sometimes like services So host like a slow, there's like some bottleneck problems, or you can send like custom completely time series, we accept up to three hours in the past. So we have to wait until we can actually start processing.
Tobias Macey
0:29:12
Yeah. And I know that that can be challenging for the application logic that you're dealing with, because you're trying to maybe build some windowing, or build some insights into the stream of data as it's coming in. And then all of a sudden, you get a whole nother batch of data points from two hours ago that you then need to re compute some correlations between two different systems or two different metrics that are coming in. And so I'm wondering if you have had to deal with that in your own work as far as being able to rely on the data?
Vadim Semenov
0:29:41
Yeah. So because like, we wait like three hours in the past in one hour in the future. We started in four hour chunks. And basically we have to use like eight hour windows and eight hour windows kind of overlap. And then we have like other problem. We also have Like orcs that can migrate into different charts, chunks, etc, etc. and tying all this data together is a big problem. And we've been working hard on it to figure out like how to do correctly migrations. So for example, if a customer gross, too big for a second chart, and we move it to a new chart, how it's going to work with overlap and data, how it's like late arriving data is going to go and so on. I wouldn't say that we'd like to solve all the problems we have. That's, that's, that's a lot of thinking went into it. And the problem is, as you said, the lots of systems also use the same data set, and they also have to have all those capabilities. We defined some strict rules about doing such migrations. And we tried to follow and overall like, we were still in trying to solve all those problems with like, overlapping data, time windows, and
Tobias Macey
0:30:59
you've been there for A few years now, I'm curious what you are most proud of, or what have been some of the most interesting projects that you have been engaged with. And out of those any of the lessons that you have found to be particularly valuable or unexpected or just interesting issues that you've had to confront? Oh,
Vadim Semenov
0:31:18
yeah, I have started. So initially, when I was hard, it was hard like to bring spark support to data log. Overall, we're at, we used to use big as our process and framework. And bringing spark support was pretty challenging in terms of I had to, we had in turn, we have internal platform that we use to launch jobs, like kind of like CueBall data breaks, and I had to like write like, lots of code there, like, figure out like, optimal settings for all the spark clusters, then write a bunch of tutorials, figure out how we're going to compile code, how code is going to be delivered has going to signal our workflow management framework that work is done and on lots of challenging aspects about that, and then once it was built like we started like, moving like some historical processing of towards it, and then the challenge was like how we use spark back then it was like 1.6 spark two point all was still in development and we were pretty hesitant to use it and spot 1.6 without like, yeah, it's, it's great, it's reliable and stable, but turns out it's not like there are like lots of place where it's crashed and over the past like four years, we've grown some expertise around Spark, but it still throws like some interesting problems on our plate and then like while we were working on like, different like systems, we had to dig deep into like how JVM works how memory is all fine, how does it work, a lighting works, how much space our data structures take how garbage collection works, how to put everything in off heap and so on so on. And we actually found like some bug in Scala itself and spark We helped to we, we didn't help to fix it. We just noticed that Yeah, actually arrays can be bigger than this number, and so on. So there are lots of things that I learned is that, for example, for spark turns out like not like at the scale that we use, not a lot of companies use it actually. And lots of like problems we've ran into, nobody have seen before. And we had to, I figured out like, oh, how are we going to do this and that, and on top of that, as well, we run most of our jobs on spot instances, which means that your cloud provider can take them away anytime. And this creates like such a violent environment. And when you have like clusters of like 5000 cores, and your instances are constantly dying and spark runs on it, you realize that you're probably not everyone's doing at what we're doing. And whenever you have a problem, you're like, Okay, I'm on my own. I need to get all this tech traces all the logs and figure like, what's up That's, that's just a small portion of problems you've run into.
Tobias Macey
0:34:04
And the last four years have also been an interesting time in terms of just the overall shift in direction for both data platforms and operations and infrastructure platforms, particularly with the rise of Kubernetes. And just overall container based systems and the proliferation of cloud platforms and cloud environments and different services. And so I know that data dog also supports a number of different direct integrations with things such as Amazon or other cloud providers or third party SAS platforms. And so I'm curious what that overall shift in the technology landscape has meant for you and your work at data dog and the types of requirements of the systems that you're trying to build.
Vadim Semenov
0:34:49
Yeah, definitely. So overall, like, as you said, like the cloud is still growing. And then there is like second wave like where everyone is moving to containers and Kubernetes and everything right? From my perspective, for example, one of the biggest problems that we had to face is that containers live so short period of time. And for each can like, you're going to have so many container ideas. And our some of our systems were not built in place for like, turn of like tags basically. And we had to fight that antiquated, like different challenging problems. The other part is like we, we made a huge push to move in all the services and crannies. And as part of my job, I had to move lots of services to Kubernetes as well. And that's not always part of like what data engineers do, but that's like, yeah, that was my problem that we had to do it at data dog. So I got a hands on with like Kubernetes overall, like, we use Kubernetes extensively a data dog and for some more services. There are like some challenges, but also, it's still for me, it's difficult to judge like how easy it is to use for like orange or engineering. Orion, typical engineer, but like better the benefits of Kubernetes, I can see that. And I can see that I can see the the platform prevailing. And so far like we've been releasing lots of reports about cloud adoption, container systems adoption. And we've been showing that Kubernetes is crawling and Docker is crawling as well. So that's just like, like Ted to life we have to live with. And
Tobias Macey
0:36:25
as you look forward to some of the projects that you've got planned for some of the coming months and years, I'm curious, what are some of the types of technologies or best practices or overall patterns and systems designs that you're trying to keep an eye on or that you're hoping to adopt? And just some of the overall types of challenges that you're anticipating as you move forward?
Vadim Semenov
0:36:48
Yeah. So we get a dated up we're not really hyped about different technologies. We, we've that we've done some of them. We saw a lot lots of some problems and As like at our scale, begin our technology is pretty. We should be conscious about how we approach that. So we don't really like put a lot of lots of like thought like lots of like figuring out what what kind of technologies we're going to use. Instead, we put a lot of efforts in how we approach engineering and built in fault resilient, fault tolerant resilient systems is where we want to be an overall like, overall, like the problems and problems that we're going to face next, from, from my perspective, going to be related to that and in terms of the actual one is probably ops. So the number of services is keep a keeps growing, but the number of engineers is not growing as fast and overall, like a human can only have like, keep so many, so many things in their their head. And you want to make sure that ops is like automated. So we need to make sure that we are building systems that can auto recover can have retries can have Proper login monitoring server etc, we can replay data easily, and so on. And the other part, which is going to be huge for us, I think is migrations. So at our scale, like when we develop new systems, we can just roll them easily, we have to run them for like certain period of time, like half a year, for example. And when I, when you run huge systems on such scale, they burn lots of lots of money. So you have to figure out like instruments that will all be to move some data between those two. And then, on the other hand, on the other hand, you have systems that consume this data, you have to make sure that those systems work reliably with partially migrated systems. And that's for that problem. We haven't seen like being solved. Like, there's like not a lot of like guidance. And we have to, like, do all our new techniques about that.
Tobias Macey
0:38:52
Are there any other aspects of your work at data dog or the types of projects that you're building or the platform in general that we didn't discuss yet? You'd like to cover before we close out the show.
Vadim Semenov
0:39:02
We have like some internal projects with Alex pocket Foreman's monitoring. So I feel like lots of companies, lots of people around like like their jobs, like lots of pipelines, but nobody attracts, like how performance of those jobs like degrade or improve over time or with certainly code pushes, etc, etc. And we're trying to in our case, like we have, like hundreds of jobs that we run, and we're trying to see like how certain changes improve or break like jobs. And we're working on it, but I'm not sure like how and when we're going to release it and research like other open source projects, and we haven't seen like this. And I had like some conversations with other engineers without without other companies. I'm like, I feel like that's like one of the problems that data engineers have. Overall only businesses is like how how can we measure how How efficient your jobs are.
Tobias Macey
0:40:01
Yeah, that's definitely a challenge because of the fact that there is such variability and seasonality of the data that you're dealing with. So the execution time of a job at one point in time isn't necessarily indicative of its overall performance because it's highly dependent on the data that it's working with. So unless you have very consistent data sets the processing on a regular basis, it's definitely difficult to be able to say with any high degree of certainty whether a particular tuning is having the desired impact without being able to measure it over an extended period. Yeah,
Vadim Semenov
0:40:33
definitely assault we not just collecting like spark metrics. So your metrics are like system metrics. We also collect data from the jobs itself and you have the same code, but you can run like as you said, like on different like data inputs, different like time periods with different parameters. And the same code will be executed like different in different variations on the same hardware and we trying to like figure out how all those parameters really to each other, and whether we actually need to improve our jobs or we find because essentially like, the basic question that we want to answer, like couple life is it is our our will our jobs work for the next like three years, how much money they will burn? And when we should start looking closer at that, because cost optimization is also a really huge part of like, what do we do?
Tobias Macey
0:41:24
Yeah, it's definitely an interesting problem space. For anybody who wants to get in touch with you or follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Vadim Semenov
0:41:42
That's really good question. tbis. See, like, I'm not super like in data management. The space of the data that I work with is pretty limited is one of them is big, but for what I've seen is like workflow management is still like kind of not solved. We have like air flow. which still has some certain problems, we have like different other workflow managers and overloaded rescheduling pipelines, rewriting reprocessing data. Waiting alerting is like I have
0:42:13
yet to see like,
0:42:15
product to solve, like workflow management to like completely agree.
Tobias Macey
0:42:20
Yeah, I agree that the workflow management space still has room to grow. And it's interesting to see some of the generational approaches to it, where, you know, maybe 1520 years ago, we had things like SSIS, and a lot of the GUI driven tools such as the Jaspersoft suite, or what came out with things like Pentaho. And then there was the sort of second generation which were things like Luigi or airflo, where we were moving more towards the workflow defined as code and being able to have it in a more software native approach. And then a lot of that was a bit too rigid in terms of the way that was defined and executed. And so now we're starting to hit the third generation where we have tools such as Daxter, and prefect and Kento that are trying to blend the requirements of data engineers and data scientists. And there still seem to be some rough edges or incomplete execution of some of the overall requirements for this space. And then you've got things like Apache knife AI that are trying to revisit the gooey oriented type of thing but work in more of a data flows type paradigm. So it's interesting to see a lot of the different generational and paradigmatic shifts in terms of how people are trying to deal with this as we deal with more data in terms of volume and variety and the different environments that we're dealing with it in and the types of consumers that are trying to be able to gain value from it.
Vadim Semenov
0:43:48
Yeah, absolutely. I totally agree with you. And the other the other one that I was thinking about is that I get like I go to different conference presented conference about like, things we've built with spark and lots of people Like approached me and start asking questions about like, why my job is so slow. And if you go over like spark like user release, you also like get like these kind of questions and this like MapReduce model that we that have been around like for 15 years, or like, I don't know, 20 we sell it, we use it in the people, there's not a lack of understanding like, like, why is that like, and still people still like positive about like, why jobs is slow. And there's like, no solution so far, like even data breaks, haven't like, they don't have like magic land that's going to show you Oh, this is why your job is slow. So there's still room for like building expertise in that village.
Tobias Macey
0:44:40
Well, thank you very much for taking the time today to join me and share your experiences of working at data dog and the types of challenges that you're facing there. It's definitely an interesting platform that you've built and one that I use for managing my system. So I appreciate all the work that you've done, and I hope you enjoy the rest of
Vadim Semenov
0:44:56
your day. Okay, thank you, Tobias. Thanks for having me. And I wish you All the best and everyone have a nice day.
Tobias Macey
0:45:08
Listening Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language, its community and the innovative ways it is being used and visit the site at data engineering podcasts. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts EPS data engineering podcasts com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!