Summary
In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.
Announcements
Parting Question
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologies
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?
- When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?
- How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?
- A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?
- Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?
- What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?
- What are the most interesting, innovative, or unexpected ways that you have seen solutions to large-scale data management designed?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data management across axes of scale?
- What are the ways that you are thinking about the future trajectory of your work??
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey. And today, I'm talking to Tulika Bhatt about her experiences working on large scale data processing at Netflix and her insights on the future trajectory of the supporting technologies. So, Tulika, can you start by introducing yourself?
[00:01:02] Tulika Bhatt:
Sure. Hey, everyone. I'm Tulika, and I'm currently a senior software engineer at Netflix. And I work in the impression space, basically creating, like, you know, data services and datasets that power the impressions impressions work at Netflix. Before that, I used to work at BlackRock, and I worked on a lot of, like, you know, mission critical financial applications. And towards the end, I was working on data science platform. And I also spent some time in Verizon building some sort of applications.
[00:01:37] Tobias Macey:
And do you remember how you first got started working in data and what it is about the space that keeps you interested?
[00:01:43] Tulika Bhatt:
I would say that I kind of, like, accidentally ventured into data. It wasn't very intentional start, but then, like, you know, every role sort of brought me, closer and closer to data engineering. So for example, in BlackRock, I worked in a variety of different teams and roles. And, towards the end, I started working, on a data, data science platform tools. So it was just like, you know, creating customized Jupyter notebooks and, like, cron jobs for data scientists and, like, you know, creating, like, those libraries that they could access data, like, easily without having to figure out the behind the scenes infrastructure issue with that. And, you know, the project when we look at, like, BlackRock and we decided to sort of, like, sell it outside, then there was this, idea of creating, like, a usage based billing system, which should, like, you know, create events and, you know, you would process and crunch those events to generate bills. So I felt like that was, like, sort of, like, more moving towards, like, you know, data. It was eventually, like, you know, creating those modeling those events, creating events, and then sort of, like, crunching them and generating builds.
And I enjoyed my work, but I wanted to do something with, like, really large scale systems. And I got the opportunity to work at Netflix, and I jumped at it. And, yeah, I'm here now sort of working in the impression space.
[00:03:10] Tobias Macey:
With the focus on impressions, obviously, that's a very specific category of data. Netflix, as an organization, is very large, has a very disparate set of data processing systems, various requirements around those different systems. Wondering if you can give some framing about the characteristics of the platforms that you're building and some of the requirements as far as latency, uptime, etcetera, and how that frames the ways that you think about the work to be done.
[00:03:41] Tulika Bhatt:
Sure. So I'll first of all define impression so that we are on the same page. So when you, like, log in to Netflix, you, like, see home pages and it creates, you see a bunch of, like, images on the home pages. So we call those, you know, like, you know, images impressions. And, those are, like, you know, sort of gateway of you discovering the product and interacting with it and eventually leading to plays. So as you see, like, you know, impressions are, like, you know, really fundamental for discovery. So it it's a really important dataset at Netflix. So we use this piece of data in variety of forms. We use it for personalization.
So to see, like, which content are you, or impression are you interacting with and its leading to place that gives us a signal, okay, these are the contents that you like. Then we also use it for, like, you know, actually constructing the home page. So it's sort of, like, using, variety of purposes of, like, you know, business use case of how home page is created. So as you can, like, you know, hear about these use cases, it's kind of fairly clear that you needed both in the batch world as well as, like, in the online services. So if you're creating home pages, obviously, the latency has to be, like, as low as possible. Like, it has to be real time for, we also, like, create, like, you know, aggregated datasets on impressions for, like, a variety of personalization and, like, you know, model training use cases.
[00:05:17] Tobias Macey:
Working on large scale systems, there are numerous technologies that have been built specifically for high speed, high volume. I'm thinking in particular about the Kafkas and Flinks of the world. I'm wondering with the requirements around the type of data that you're working with, the speed at which it's coming through, and the latencies at which you're trying to deliver actionable insights to the downstream consumers. How much of the technology that you're working with are you able to pull off the shelf from, whether open source projects or commercial projects versus having to think about it from a, greenfield, you know, architecture design principles of this is what I need to do. These are the primitives that I need to be able to build from. And because I have a very bespoke need, I need to build a custom system and, you know, what the gradations are along those axes of just pulling off the shelf to build something from whole cloth.
[00:06:24] Tulika Bhatt:
Yeah. So I think, like, you know, whenever we are evaluating, like, technology, obviously, it's like the first, decision is, like, you know, whether you're gonna get it from, like, something from open source or do you need to build it as much as possible. I think the first step we do is we evaluate, like, you know, if there is already a solution available. So, like, you know, for most purposes, like, you know, Spark or, like, you know, Flink or Kafka is kind of, like, very good at what they're doing. So we don't just go about and, like, reinvent, like, those things again. We just use them, like, for their purposes. But there might be certain, like, you know, use cases where we are hitting the boundaries, and then we need to some think of, like, you know, customized solution in order to, like, you know, achieve that.
For example, there was this one use case where, like, you know, I was crunching something, using Spark and Iceberg, and then I wanted to populate it to online data store like Cassandra. And, you know, that was something, which is, like, you know, there was just no tooling available for us to just do it directly and, you know, just you you could write a script going through, like, you know, all the rows and, like, you know, and, like, you know, just batching it and writing it to the Cassandra found row directly, but it was just, like, you know, impossible to kind of, like, tune scale. It took, took a lot of time. So and, it would overwhelm Cassandra a lot and wasn't working. So they kind of work with the platform team, and they build this custom of the shelf solution of actually, like, you know, taking the data in iceberg and just converting it to, you know, assist the files and just directly sort of, like, you know, copying that in actually, we didn't eventually send that in Cassandra. We ended up in, like, in copying that in ProxDB and serving the online use case. But that was something which was, like, you know, there was no sort of, like, direct tooling available, so we had to be a little bit creative about what needs to be done and, solve this particular problem.
[00:08:34] Tobias Macey:
One of the benefits that you get by leaning on existing tools such as Spark, Flink, etcetera, is that they have had a lot of investment from the broader community, particularly around things like deployment, observability, being able to pull together different components that are designed for various use cases. Whereas the benefit that you get from building custom is that you're able to design it specifically to suit your need. You don't have to worry about bringing in a whole bunch of extra functionality that's just dead weight and extra complexity for the problem you're trying to solve. And, obviously, when you're at a company like Netflix, you have enough resources to be able to invest in custom development, but you also don't have infinite time. And I'm wondering, at least within your team, what the general heuristic is as far as the build versus buy and how far are you willing to push one of these off the shelf systems before you decide that you actually need to split out from that and build something on your own or build some additional component that, solves the need that you want within that framework?
[00:09:40] Tulika Bhatt:
So I think this is a really interesting question. There's always, this, talk about build versus buy. So, like I said, like, you know, for most of the use cases, we first evaluate, like, how much we can push the original, like, you know, solution which is out there, the open source one. We do have a flavor of, like, spark inside, but it has, like, you know, some sort of, like, you know it's not exactly the open source spark. It has, like, you know there's a platform team which is kind of, like, wrapping our requirements, our needs over it, so which is, like, solving all all our authentication and other, like, you know, problems so you don't have to do everything from scratch.
Then, I think I'm extremely lucky working on Netflix because it always kind of like you know, it's a place where you you can always, sort of, like, innovate. There's no there's always a like, you know, it's very open, innovation culture. So if you feel like, you know, there is a better solution out there and, you know, you believe in it, so at Netflix, I think we are very, like, you know, eager to actually experiment something and, you know, kind of, like, build something on our own if if we feel that it suits our need. So and this, I feel, is different is because, like, you know, in a lot of other, like, you know, companies, then there's not that much freedom to actually sort of, like, you know, in a way, you kinda have to be, within, like, you know, company sanctioned, like, technologies or something. So I don't know if I answered your question.
[00:11:16] Tobias Macey:
No. That's definitely helpful. And another aspect of that problem is that in order to be able to even have an informed decision about when and whether to break away from the existing frameworks is that you have to have enough domain knowledge about the problem space, about the technologies, and about the core primitives and software principles about how these systems work, which obviously requires a lot of experience and on the job learning as well as whatever education you have coming into it. And while it's definitely very possible to become an expert in Spark or in Flink because of the resources that are available, it's a much more it's much more complicated to get that breadth of knowledge, and it's not something that you necessarily have a road map to to go down to say, okay. These are all of the things I need to know, and these are how I obtain all of them. It it's usually a very ad hoc experience. And from your career path and your experiences of working at BlackRock and now at Netflix and Verizon prior to BlackRock, what do you see as the opportunities for that learning that have been most useful and some of the ways that you've had to stretch your knowledge and gain new skills in order to be able to tackle these problems as they arise and just some some of the challenge that that poses as an engineer to be able to have that breadth and depth of understanding.
[00:12:41] Tulika Bhatt:
Yeah. Definitely. I I feel like, you know, working in the software world is just, like, you know, immense. It's just like ocean of knowledge. You you cannot just, be like, claim to be an expert of anything because there's just so much out there that you don't know. So, like, for example, while I was working at, BlackRock and we were sort of like, you know, that I'm experimenting with Kubernetes, I feel like, you know, having a good breadth of foundational knowledge is good. I feel like the, like, the so, for example, if you're evaluating which cloud data technology to use or which database to use, I think having the foundational knowledge of, like, you know, how this database works, what are the what is, like, you know, positives, negatives, what is your particular use case, is it, like, write heavy, read heavy, what is the latency requirement, how are you gonna partition it, like, all of that is sort of like, you know, the fundamentals are really important.
And from there, you kind of, like, you know, go about looking for, like, you know, different options available online and, like, you know, comparing them. And I don't think you can choose to ex be an expert in certain, like, you know, tools. So for example, you can choose to be expert in, like, say, Flink or Spark or or, you know, a certain database. But, yeah, I wouldn't imagine, like, anybody to be, like, you know, have the whole breadth of knowledge as well as, you know, expertise and, like, you know, everything. And that's kind of, like, I guess, sort of impossible.
[00:14:13] Tobias Macey:
In terms of the team that you're working with, what are some of the ways that you as a group lean on each other to accommodate the knowledge gaps and understanding that each of you have and ways that you're able to work together to be able to understand the entire breadth of the problem space and help level each other up, particularly if somebody's coming in where they say, I got hired on because I'm an expert in Spark, but now we're dealing with iceberg tables or now we're dealing with Flink stream processing, and I'm way out of my depth here.
[00:14:44] Tulika Bhatt:
We definitely lean on each other, for that sort of, like, you know, help. We also are, like, you know, like, they're, like, you have dedicated, like, you know, platform teams who are, like, you know, experts on certain things, and their literally job is to have a, like, you know, eye on the market and keep evaluating what new technologies are out there. And then they also kind of, like, you know, bring up reports where they evaluate something which is, like, out there and say, oh, this is, like, you know think this is something, good, and we can, think about, like, you know, adopting this technology sort of, like, you know, internally.
Otherwise or whether they think that this might not suit our company so more. So it's actually, like, you know, I'm very lucky to have that resource. So they do the first line of groundwork for us when we are sort of evaluating if we need to go after something else and lean on them, get their expertise. If we do not agree with them or if I we feel like our use case is very niche, then we can also, like, kind of, like, you know, put forward a counterargument that, hey. I think this this this, like, you know, works for us and this suits us. So, as long as you have enough good arguments and data, points, like, you know, backing your use case, you can just go and use anything that you want over here.
[00:16:07] Tobias Macey:
And then particularly along the axis of reliability, observability, data quality management, when you are building custom components or extensions to some of those off the shelf frameworks? What are some of the principles or core libraries that you've developed to be able to manage the visibility into those, custom implementations so that you don't lose context or lose visibility into the overall flow and quality of information that you're processing?
[00:16:42] Tulika Bhatt:
This is interesting. So I think observability and reliability is still, like, you know, sort of, I would say, a little bit sore from them. But I think, like, data quality is something which is sort of, like, you know, there's not, like, straightforward, like, in a solved answer. We have a lot of, like, you know, custom tooling. But to be honest, I don't think so we have something comprehensive, which is, like, you know, taking care of, everything. We have, like, bunch of different tools which are, like, you know, discrete, and they're doing, like, they they are pretty good at one particular stuff, and we have to kind of use, like, combination of everything to make sure, okay, the data quality is, like, you know, fine.
So that's an interesting and, like, you know, unsolved space for us right now. And, for example, like, in data quality, I feel like we have solved, like, you know, schema issues. You can have, like, you know, schema or whatever checkers and, like, you know, and all of that. That all of that is working fine. We have, like, you know, volume based auditing and everything. That also works pretty well for us. You can, gauge it, if something is, you know, not going as as it is supposed to be. But one thing that missing in our tooling is, like, you know, semantic, I guess, semantic, auditing.
So, for example, if some event happens and you log something, your log output might be correct according to schema. The volume might be correct according to what the trends you are seeing, but maybe you have logged the wrong thing. And there's not a really good way on how to, you know, actually catch those sort of errors.
[00:18:28] Tobias Macey:
Another element of data observability is that it will also typically be at least somewhat correlated with observability of the operational infrastructure where the data didn't get delivered because one of the nodes crashed, and we have a split brain situation or we lost quorum, and so we're not able to actually move forward. And I'm wondering, what are some of the techniques that you have found helpful to be able to thread together that operational visibility, the operational characteristics of the underlying platforms with the actual data observability and data quality management to be able to more quickly understand what is the actual root cause, particularly when it's something operational and not a logical bug?
[00:19:17] Tulika Bhatt:
Operational bugs, at least at metrics, are, like, you know, easier to detect. So for example, like, you know, if we have, like, you know, robust system, like, you know, workflow, orchestrator, which is called, like, Maestro, I think it's an open source product too. So if, like, you know, if something goes wrong, we automatically like, you know, it kind of, like, you know, is integrated with our, alerting system, PagerDuty. So you would automatically sort of get alerted that, hey. There is, you know, this workflow, there's something wrong. We you know, it it threw, like, spark error or something.
Now if you're not a direct, like, you know, owner of the workflow, if you're, like, some some sort of consumer, we have, like, different sort of alerts. Like, you know, we have, like, you know, a time to complete alert. So we normally like, we can set alerts. So this workflow took takes, like, you know, three hours to complete. But if it for some day, it took, like, you know, four or five hours, so it just automatically, like, fires it. And it's like, you know, we can easily go on upstream and sort of, like, you know, check of whether, like, you know, what happened in the entire, like, you know, processing pipeline. Was there, like, you know, some failure or something?
Then we have alerts like, you know, freshness alerts. So, like, we call it, valid to timestamp, DTS. So something like, you know, if whenever a data is written, audited, and, you know, sort of, like, you know, processed, we release, like, you know, a flag saying, like, you know, this data is corrected, like, you know, fresh and corrected, like, this particular time stamp. So you can have those alerts too. So if we feel like, oh, this data has not been, like, you know, freshened up for a while, it's it's, like, you know, it's been, like, five hours, and, we didn't get any fresh data from that particular source. So it automatically kind of, like, triggers another alert, and then it is routed to, like, you know, you can set it up like Slack, PagerDuty, or whatever and just get automatically notified that there's something wrong over there, and you can go and investigate.
So I would say, like, operationally, like, it's like I think it's it's in a okay situation, right now. We're also sort of trying, some a little bit of, like, you know, self resolving alerts where if we feel like, you know, a data like, you know, we were supposed to receive suppose 10,000 records and we only received, like, 8,000 and there might be a late arriving, data issue, then there are, like, you know, self healing pipelines where, like, an auto backfill sort of, like, kicks off and automatically kind of backfills the data. So, like, sort of, like, slowly, incorporating that. It's not, like, a % there, but, yeah, that's that's also one way of how we sort out operational issues.
[00:22:11] Tobias Macey:
And one of the aspects of operating large, high uptime systems is that they require automation to be able to actually sustain them. Whereas if you're running a smaller scale, you're running lower volumes of data, you can maybe get away with manual processes for error detection, error correction. And I'm curious what you see as lessons from working at these high uptime, high scale environment that are translatable to some of those smaller scales where you can say, oh, it's easy to automate this thing, so it just should just be part of the standard operating procedure no matter what your scale and what are the aspects of large scale automation principles that don't translate as well to medium to small scale systems.
[00:23:04] Tulika Bhatt:
I know, like, in, like, you know, smaller scale system, you can definitely get away with having a lot of, like, manual processes, like, you know, alerting, backfilling, or whatever. But I do think that it's important to keep these things in mind while you are designing this particular these kind of, like, you know, systems because you never know when like, you know, in a blink of eye from your small scale systems, you just end up in a, like, in a in a in a place where you're suddenly getting much more events than you originally planned for. So, for example, I feel like even though we have a lot of things that are automated at, Netflix, having a good alerting strategy is still painful.
Like, we either under alert or over alert. So which both of them leads to their, like, you know, own set of, like, problems. So I think, like, at least from in my experience, whenever I'm thinking of, like, designing a new data pipeline, I always kind of, like, you know, think about, okay, what what are we measuring? What are the, like, you know, the who am I consumer? What's the impact? What should the alerting strategy be? I think that's a that's a good, I would say, thing to do even if you're starting with, like, you know, a smaller pipeline. Having, like, you know, having a good runbook even just for, like, you know, practice or something goes wrong. This is my runbook. These are all the details over there.
Do I need anything out of my like, you know, outside of my runbook if I need to solve this particular problem. And and the answer should be no. Your runbook should be complete. But that's a good exercise to see. Sounds easy, but I'll tell you it's more often than not not easy. So but just getting in the habit, like, you know, right from the very beginning, I think it would be, like, you know, useful thing to do.
[00:25:02] Tobias Macey:
And, obviously, the topic that has taken everybody's attention for the past couple of years is AI and all of its various applications. As data engineers, there are and software engineers and engineers working in technical systems, there are definitely ways that we can use generative AI to automate and accelerate our work. But there is also a challenge of being able to feed it the right context, feed it the right understanding about the problem that you're trying to solve. And I'm wondering how you're seeing that factored into the work that you're doing at Netflix, and in particular, given the size and complexity of the systems that you're operating, being able to feed enough of that architectural knowledge into the AI systems to be able to get them to give you useful outputs without just having to fight with it and go through, an untold number of rounds of prompting.
[00:26:06] Tulika Bhatt:
To be honest, I would say that we have had, like, mixed results at Netflix. So for example, like, we do have, like, an internal version of, like, you know, a search which is kind of trained on our, like, you know, on the internal documentation because that, bot has, like, sort of, like, you know, overall view of documentation, so it's definitely doing better. We do have, like you know, we can, like, sort of, as fun projects, create our own Slack bot team on our, like, you know, support questions or train it on our run book and help them to answer, like, you know, questions. I think those have had mixed results, like, not super great. Sometimes it's just, like, feels like there's more effort in, like, you know, training the bot and making sure it's answering the right thing than just going and searching the answer by myself and answering it.
So but we definitely, like, you know, use Gen AI tools, those bots for, like, you know, developer productivity. We use it, like, you know, we have, use it for, like, auto completion or in notebooks just to, like, you know, prompt, I guess, fancy auto complete. And then, we definitely do integrate it with our pull request to give, like, you know, some sort of, like, suggestions on, like, you know, code quality and all of that. I would say six out of 10 times, it gives pretty okay con comments. Otherwise, it's, kind of like you know, sometimes it's Mhmm. Just doesn't work.
Other time, I think one other use case that's, pretty good is, like, you know, we kind of integrate it with, like, our build boards or, like, you know, our workflow blog board. So if there is any error that happens, so it automatically kind of reads the log and bubbles it up. So you don't have to and most of the time, it does create. So saving clicks, I would say, instead of you going and, like, now going through Spark and, like, you know, or going through Jenkins and getting the logs and see, okay, what happened. And it it, like, bubbles it up so that makes things faster. What else?
Yeah. I think that's the, I guess, breadth of use cases we have right now for the for LMS right now.
[00:28:19] Tobias Macey:
And as you have been building systems, working at Netflix, tackling the various data challenges that are faced by the scale at which you're operating? What are some of the ways that you foresee or would like to see some of the off the shelf technologies or some of the adopted best practices evolved to incorporate some more of this AI automation and intelligence into the actual processing layer to allow human operators to move even faster and not have to worry as much about the low level minutiae of the bits and bytes and work at a higher level of problem solving?
[00:29:04] Tulika Bhatt:
I think, like, maybe, with things things get better with LLM, I I think it could be, like, you know, more helpful in, like, you know, in the actual, sort of, like, prompting for, obviously, like, you know, code completion, writing code, or something. Actually, let me take a step back. So when I'm designing this, like, you know, any large scale system over here, I think the first step is actually designing the data model, coming up with it, the criteria of when that event is gonna fire, how that event looks on, like, different clients, client devices, and all of that. I think for that, I don't know if there's a good way to sort of, like, get it solved by, LLMs.
But probably, like, you know, it's just, like, needs so much context, and it's, like, going to be very difficult to provide all of that context to what and expect, like, great answers. It can be a good, like, you know, starting step, but it's just there's so much work. So I don't envision any, like, you know, it being useful in that aspect. But once we have, like, say, a data contract or something, probably, we could, like, you know, be integrating it with, like, you know, these, bots and sort of, like, you know, generating code from that. Also, like, you know, planning, it can little bit help on, like, you know, planning on the architecture side. So for example, if I need a fling job and I could just feed it, okay, these are the like, you know, I'm expecting this much, like, you know, volume, this much, like, you know, each event is gonna be this size. Like, can you propose, like, you know, what's the starting sort of, you know, cluster side cluster size? I can, like, start off with to accommodate all of this. So probably if it if it could, like, you know, getting all of this input, like, create, like, you know, the first initial cluster for me, I think that would be great if I can just get to that point.
[00:31:01] Tobias Macey:
And another aspect of the ecosystem is that a lot of the tools that have become the widely adopted standards are getting to be old in technological terms where they've been around for a decade plus. Spark in particular was built in response to Hadoop and the challenges that people were facing with Hadoop. There have been various generations of successive technologies that are taking aim at Spark and Flink. I'm wondering what you see as the forward looking viability of Spark and Flink as the primary contenders in the ecosystem now that there has been enough understanding of their characteristics, the benefits that they provide, and the shortcomings in terms of their technical architectures and some of the ways that newer systems are designed to address some of those challenges.
[00:31:56] Tulika Bhatt:
I don't think, like, I see something which is going to be a replacement for, like, you know, Spark or Flink as of now. I do think, like, there's a lot of things that needs to be better with, like, you know, Spark and Fling, for example. I think we just started auto scaling on Fling. I mean, I don't know, about, like, the, like, what version it it was released on, but I think we just started, internally with auto scaling of Fling before. If any, like, you know, thing this is one of the bigger problems that we have in the data engineering world where, like, you know, if you we didn't have, like, you know, systems that would automatically scale a light, you know, in software engineering, which is, like, you know, it's kind of, like, stateless. And if something, if there's more traffic, automatically, you can set up, like, you know, traffic guards or something and then, like, auto scalers and they scale it. For Flink Job, we always had to manually scale it whenever we would get, oh, there's a lot a lot of events coming in, consumer lag increasing. Even for stateless job, we had to do it. So I think they started with that. I don't think so we have a solution for stateful jobs as of yet. So that's going to be an interesting problem to solve. As things become more and more real time, you obviously don't want a human in there to be actually, like, you know, operating your, operating your process size and, these things can be taken care of. It'd be great. And, also, for example, for Spark, even though we have, like, optimizations on Spark, it doesn't really actually work all the time.
Also, like, you know, if there is, something, like, you know, hot partition or, like, you know, skewness or something, it requires a lot of, like, you know, manual, I would say, intervention. The regular, like, you know, parameters are not working. It just, like, fails, and then you have to sort of retune it and redone it. So I think that is also another, like, you know, bigger problem which has not been solved in Spark. So even though these technologies have been here for a long time, there's still a lot of, like, you know, improvements that needs to, you know, happen so that we can keep up with the, with future where, like, you know, more and more real time data needs will arise.
[00:34:06] Tobias Macey:
And in your experience of working in this space of high volume, high speed data, and high uptime requirements at Netflix, what are some of the most interesting or innovative or unexpected ways that you have seen either your team or adjacent teams address the design and implementation of solutions to those large scale data management problems?
[00:34:32] Tulika Bhatt:
This is interesting question. I think the first one that comes to my mind is, it's in the impressions project, so it just immediately streamed to my mind. So there was this use case of providing, like, you know, a year worth of, impressions for certain models. And, you know, providing a year worth of a person's impression is, in real time is just like, you know, an impossible ask because I think one person would see around, like, you know, maybe hundred thousands or even more impressions in a year is just, like, a really big ask. So we were thinking about the ways on how we could, like, you know, achieve that, then it was, like, you know, the the raw data, that was an impossible answer, like, you know, getting that in real time and, like, you know, reasonable latency. So we were like, okay. What if we can aggregate it? What if we can reduce it somehow? So that's how we came to, like, you know, using something called as, like, you know, EMEs. Just like, you know, taking, impressions and converting into numbers using, like, formula. So now from object, we went to numbers, but still, like, you know, one year is it's a lot. And, also, like, you know, you gotta have some job which process one year of impression and, like, you know, converts it into numbers. So we came up with, like, you know, having a Spark job that would do the scrunching and store it into Iceberg table. Now all is good and fine. We have data in Iceberg, but we cannot expose Iceberg to GRPC service. And so from there, we need, like, you know, some something else, like an online data store from there.
So I this is the project I told you about where we actually sort of, like, you know, we had to come up with an a new technique of how we can, like, you know, take this entire gigantic universe of data and kind of, like, you know, sort of upload it into an online data store. So we sort of devised a clever technique where what we do is, like, you know, we upload this data sort of, like, you know, weekly. And then we have another sort of, real time process which kind of, like, you know, service which takes these, impressions, does some, like, you know, online crunching for a week and takes week old data from this computed week old data and just, like, combines it together and sort of, like, you know, provides real time one year worth of impressions. So I felt like that was an interesting use case because, normally, we have used Spark only for, like, you know, analytics serving analytics purposes, and it's, like, after the fact. But this time, we use Spark and Iceberg in the actual, like, you know, actual operational flow, and we kind of, like, you know, we are powering, like, you know, a use case in the, like, you know, in in real time system, like, GRP service.
And we have, like, this, Spark and, you know, Iceberg and equation. So I felt like this was an interesting, project, like, where you used both software engineering and data engineering and kind of combined together and built, like, one product. I guess there are other examples too. Just like whatever really fulfills our requirement, we are always open to open for innovation and going beyond, like, you know, normal ways and getting that. And in your work of
[00:37:35] Tobias Macey:
operating in such a high demand environment and learning more about the various primitives involved in building these data systems and maintaining reliability, what are some of the most interesting challenging lessons that you learned personally?
[00:37:51] Tulika Bhatt:
I think, to be honest, I feel like at least in, this world, technical problems are, like, you know, easier to solve. It's more just like organizational problems that are harder to navigate, and that's been my lesson. So, for example, like, you know, when you are dealing with, impressions, now impressions are going to be created on, like, you know, client devices. And by client devices, I mean, like, you know, the it it's going to be web or TV or your phones or something. So, like, you know, being generated over there. Now all of these client devices, they'll have their own limitations on how they can generate that event, whatever, like, you know, how much, like, you know, that particular client, like, you know, has, like, you know, logging facilities or whatever on the device and how much it can capture and whatever, like, you know, how much date, like, capabilities the manufacturer is providing them. So all of these, like, limitation kind of, becomes really important to know when you are sort of, designing an event because now you have all of these constraints that you need to keep in mind, which, like, you know, which maybe earlier you didn't need to know. You just needed to know, like, you know, this is my API or whatever. Now I need to know, okay, how what are the constraints of their systems?
Then, what is their release cycle? What is their, like, you know, how are they testing their logging artifact? How are they doing canaries? How are they doing AB testing. So all of that is, like, you know, there's just so much context required to do work in this space. So this is what which has been, like, an interesting observation for me personally.
[00:39:26] Tobias Macey:
And as you continue to work in this space, invest in the reliability and capabilities of the data systems that you're responsible for, what are some of the lessons that you're looking to grow in, some of the resources that you rely on staying up to date and adding to your understanding of the space, and just general advice that you have for folks who want to be able to move into a similar position?
[00:39:53] Tulika Bhatt:
For me, personally, I am really invested in learning, like, you know, what's up with, like, you know, data quality. Can we finally get, like, you know, one solution that fits all and can solve most of my problems? So that is what I'm, like, you know, personally invested in. Like I said, like, currently, my pre my my project is to go with the producers, sit and understand their use cases, like, how they you know, what constraints they're operating in in order to produce an event, how does the whole, dev life life cycle goes through. So I think this is a a good exercise, and this is something I would encourage others also to do. Like, you know, I don't know if it is just me, but I feel like, you know, it's easier to, understand how data is being used by consumer, whereas the producer aspect of it is something becomes, like, you know, more black boxy. So, like, you know, just digging in over there and understanding what's happening in that world, which it and that could help with you, like, you know, having better strategy for your, you know, data quality. You don't have to you can actually stop bad data from going in if you are more, like, you know, plugged in with how your consumers are doing their testing and their whole network life cycle. So that would be one advice, be more plugged in with the data production process.
And, regarding how I keep up, I think I keep up with things normally as, like, any other folk in the field is keeping up, like, you know, just, like, reading the type of blogs of other, like, you know, other companies, other newsletters, podcasts, and, like, you know, even, like, conferences just, like, listening. What's happening? I know there is, like, you know, some work going on on data contract space. There was an open source project, so I'm curious, like, you know, what that will lead to. Yeah. And there there's, like, you know, some work which has been I think LinkedIn has a open source, sort of, like, data quality platforms. I'm following up over there too, and I'm seeing if that can be something that we can actually adopt and use for our use cases.
[00:41:51] Tobias Macey:
Are there any other aspects of the work that you're doing, the lessons that you've learned working on large scale data processing systems, or the predictions
[00:42:01] Tulika Bhatt:
or wishes that you have for the future of data systems that we didn't discuss yet that you'd like to cover before we close out the show? I think working in, you know, small scale world as well as moving to, like, in a large scale world, I think one thing has become clear with me is you need to definitely have good fundamentals. So, I mean, in small scale world, probably, you can, you know, get away if you are not, like, you know, thinking about the architecture that much. You're not thinking about, like, you know, reliability that much or, you know, alerting or whatever. If there are, like, you know, inefficiencies, you can just cover up them with, like, you know, manually fixing things and all of that. But all of that really explodes when you go to a large scale system. Like, if you don't think about the design carefully, even, like, you know, minutes of, like, you know, downtime can mean, like, millions of events are now gone and can have have an impact. So, like, you know, it becomes very imperative when you are, like, you know, designing any system to think about all the challenges.
Also, like, you know, not only, like, you know, about reliability, but also from cost standpoint, like, you know, also think about that. Like, okay. I'm going with this technology. Like, how much is it's gonna cost? Also, like, really negotiating about the amount of data that you need to have, that you need to store is, like, you know, do you really need that data? Because all of this is, again, is gonna cost you to, you know, process, store. So all of these, decisions just become really important as your scale kind of increases. That would be my advice. Like, don't forget the fundamentals, lead designing data applications.
[00:43:44] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:44:00] Tulika Bhatt:
Yeah. Definitely. I think, I already talked about something. We need a better solution for data quality or for a semantical, like, you know, answers for data quality. That's one. We need, like, you know, definitely, you know, more reactive tooling for, real time purposes. So, like, you know, I don't know, more, autoscalers for, for, Fling, stateful jobs, then definitely we need more sort of, like, you know, kind of investment in, performance optimization for Spur and, like, you know, less sort of manual tuning and manual feedback. And over there, I think that is, like, another unsolved space that we have for now.
[00:44:42] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences of working at Netflix and being responsible for high speed, high volume, and high uptime data systems. It's definitely a very interesting problem space to be working in. Has a lot of valuable lessons to be learned from it. So thank you again for, taking the time to share that with us, and I hope you enjoy the rest of your day. Yeah. Thank you so much. It was really nice talking with you. And, yeah, we had a lot of, like, you know, great conversation,
[00:45:12] Tulika Bhatt:
a lot of, questions that I would also think about later in my day. Thank you so much for taking this.
[00:45:25] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macey. And today, I'm talking to Tulika Bhatt about her experiences working on large scale data processing at Netflix and her insights on the future trajectory of the supporting technologies. So, Tulika, can you start by introducing yourself?
[00:01:02] Tulika Bhatt:
Sure. Hey, everyone. I'm Tulika, and I'm currently a senior software engineer at Netflix. And I work in the impression space, basically creating, like, you know, data services and datasets that power the impressions impressions work at Netflix. Before that, I used to work at BlackRock, and I worked on a lot of, like, you know, mission critical financial applications. And towards the end, I was working on data science platform. And I also spent some time in Verizon building some sort of applications.
[00:01:37] Tobias Macey:
And do you remember how you first got started working in data and what it is about the space that keeps you interested?
[00:01:43] Tulika Bhatt:
I would say that I kind of, like, accidentally ventured into data. It wasn't very intentional start, but then, like, you know, every role sort of brought me, closer and closer to data engineering. So for example, in BlackRock, I worked in a variety of different teams and roles. And, towards the end, I started working, on a data, data science platform tools. So it was just like, you know, creating customized Jupyter notebooks and, like, cron jobs for data scientists and, like, you know, creating, like, those libraries that they could access data, like, easily without having to figure out the behind the scenes infrastructure issue with that. And, you know, the project when we look at, like, BlackRock and we decided to sort of, like, sell it outside, then there was this, idea of creating, like, a usage based billing system, which should, like, you know, create events and, you know, you would process and crunch those events to generate bills. So I felt like that was, like, sort of, like, more moving towards, like, you know, data. It was eventually, like, you know, creating those modeling those events, creating events, and then sort of, like, crunching them and generating builds.
And I enjoyed my work, but I wanted to do something with, like, really large scale systems. And I got the opportunity to work at Netflix, and I jumped at it. And, yeah, I'm here now sort of working in the impression space.
[00:03:10] Tobias Macey:
With the focus on impressions, obviously, that's a very specific category of data. Netflix, as an organization, is very large, has a very disparate set of data processing systems, various requirements around those different systems. Wondering if you can give some framing about the characteristics of the platforms that you're building and some of the requirements as far as latency, uptime, etcetera, and how that frames the ways that you think about the work to be done.
[00:03:41] Tulika Bhatt:
Sure. So I'll first of all define impression so that we are on the same page. So when you, like, log in to Netflix, you, like, see home pages and it creates, you see a bunch of, like, images on the home pages. So we call those, you know, like, you know, images impressions. And, those are, like, you know, sort of gateway of you discovering the product and interacting with it and eventually leading to plays. So as you see, like, you know, impressions are, like, you know, really fundamental for discovery. So it it's a really important dataset at Netflix. So we use this piece of data in variety of forms. We use it for personalization.
So to see, like, which content are you, or impression are you interacting with and its leading to place that gives us a signal, okay, these are the contents that you like. Then we also use it for, like, you know, actually constructing the home page. So it's sort of, like, using, variety of purposes of, like, you know, business use case of how home page is created. So as you can, like, you know, hear about these use cases, it's kind of fairly clear that you needed both in the batch world as well as, like, in the online services. So if you're creating home pages, obviously, the latency has to be, like, as low as possible. Like, it has to be real time for, we also, like, create, like, you know, aggregated datasets on impressions for, like, a variety of personalization and, like, you know, model training use cases.
[00:05:17] Tobias Macey:
Working on large scale systems, there are numerous technologies that have been built specifically for high speed, high volume. I'm thinking in particular about the Kafkas and Flinks of the world. I'm wondering with the requirements around the type of data that you're working with, the speed at which it's coming through, and the latencies at which you're trying to deliver actionable insights to the downstream consumers. How much of the technology that you're working with are you able to pull off the shelf from, whether open source projects or commercial projects versus having to think about it from a, greenfield, you know, architecture design principles of this is what I need to do. These are the primitives that I need to be able to build from. And because I have a very bespoke need, I need to build a custom system and, you know, what the gradations are along those axes of just pulling off the shelf to build something from whole cloth.
[00:06:24] Tulika Bhatt:
Yeah. So I think, like, you know, whenever we are evaluating, like, technology, obviously, it's like the first, decision is, like, you know, whether you're gonna get it from, like, something from open source or do you need to build it as much as possible. I think the first step we do is we evaluate, like, you know, if there is already a solution available. So, like, you know, for most purposes, like, you know, Spark or, like, you know, Flink or Kafka is kind of, like, very good at what they're doing. So we don't just go about and, like, reinvent, like, those things again. We just use them, like, for their purposes. But there might be certain, like, you know, use cases where we are hitting the boundaries, and then we need to some think of, like, you know, customized solution in order to, like, you know, achieve that.
For example, there was this one use case where, like, you know, I was crunching something, using Spark and Iceberg, and then I wanted to populate it to online data store like Cassandra. And, you know, that was something, which is, like, you know, there was just no tooling available for us to just do it directly and, you know, just you you could write a script going through, like, you know, all the rows and, like, you know, and, like, you know, just batching it and writing it to the Cassandra found row directly, but it was just, like, you know, impossible to kind of, like, tune scale. It took, took a lot of time. So and, it would overwhelm Cassandra a lot and wasn't working. So they kind of work with the platform team, and they build this custom of the shelf solution of actually, like, you know, taking the data in iceberg and just converting it to, you know, assist the files and just directly sort of, like, you know, copying that in actually, we didn't eventually send that in Cassandra. We ended up in, like, in copying that in ProxDB and serving the online use case. But that was something which was, like, you know, there was no sort of, like, direct tooling available, so we had to be a little bit creative about what needs to be done and, solve this particular problem.
[00:08:34] Tobias Macey:
One of the benefits that you get by leaning on existing tools such as Spark, Flink, etcetera, is that they have had a lot of investment from the broader community, particularly around things like deployment, observability, being able to pull together different components that are designed for various use cases. Whereas the benefit that you get from building custom is that you're able to design it specifically to suit your need. You don't have to worry about bringing in a whole bunch of extra functionality that's just dead weight and extra complexity for the problem you're trying to solve. And, obviously, when you're at a company like Netflix, you have enough resources to be able to invest in custom development, but you also don't have infinite time. And I'm wondering, at least within your team, what the general heuristic is as far as the build versus buy and how far are you willing to push one of these off the shelf systems before you decide that you actually need to split out from that and build something on your own or build some additional component that, solves the need that you want within that framework?
[00:09:40] Tulika Bhatt:
So I think this is a really interesting question. There's always, this, talk about build versus buy. So, like I said, like, you know, for most of the use cases, we first evaluate, like, how much we can push the original, like, you know, solution which is out there, the open source one. We do have a flavor of, like, spark inside, but it has, like, you know, some sort of, like, you know it's not exactly the open source spark. It has, like, you know there's a platform team which is kind of, like, wrapping our requirements, our needs over it, so which is, like, solving all all our authentication and other, like, you know, problems so you don't have to do everything from scratch.
Then, I think I'm extremely lucky working on Netflix because it always kind of like you know, it's a place where you you can always, sort of, like, innovate. There's no there's always a like, you know, it's very open, innovation culture. So if you feel like, you know, there is a better solution out there and, you know, you believe in it, so at Netflix, I think we are very, like, you know, eager to actually experiment something and, you know, kind of, like, build something on our own if if we feel that it suits our need. So and this, I feel, is different is because, like, you know, in a lot of other, like, you know, companies, then there's not that much freedom to actually sort of, like, you know, in a way, you kinda have to be, within, like, you know, company sanctioned, like, technologies or something. So I don't know if I answered your question.
[00:11:16] Tobias Macey:
No. That's definitely helpful. And another aspect of that problem is that in order to be able to even have an informed decision about when and whether to break away from the existing frameworks is that you have to have enough domain knowledge about the problem space, about the technologies, and about the core primitives and software principles about how these systems work, which obviously requires a lot of experience and on the job learning as well as whatever education you have coming into it. And while it's definitely very possible to become an expert in Spark or in Flink because of the resources that are available, it's a much more it's much more complicated to get that breadth of knowledge, and it's not something that you necessarily have a road map to to go down to say, okay. These are all of the things I need to know, and these are how I obtain all of them. It it's usually a very ad hoc experience. And from your career path and your experiences of working at BlackRock and now at Netflix and Verizon prior to BlackRock, what do you see as the opportunities for that learning that have been most useful and some of the ways that you've had to stretch your knowledge and gain new skills in order to be able to tackle these problems as they arise and just some some of the challenge that that poses as an engineer to be able to have that breadth and depth of understanding.
[00:12:41] Tulika Bhatt:
Yeah. Definitely. I I feel like, you know, working in the software world is just, like, you know, immense. It's just like ocean of knowledge. You you cannot just, be like, claim to be an expert of anything because there's just so much out there that you don't know. So, like, for example, while I was working at, BlackRock and we were sort of like, you know, that I'm experimenting with Kubernetes, I feel like, you know, having a good breadth of foundational knowledge is good. I feel like the, like, the so, for example, if you're evaluating which cloud data technology to use or which database to use, I think having the foundational knowledge of, like, you know, how this database works, what are the what is, like, you know, positives, negatives, what is your particular use case, is it, like, write heavy, read heavy, what is the latency requirement, how are you gonna partition it, like, all of that is sort of like, you know, the fundamentals are really important.
And from there, you kind of, like, you know, go about looking for, like, you know, different options available online and, like, you know, comparing them. And I don't think you can choose to ex be an expert in certain, like, you know, tools. So for example, you can choose to be expert in, like, say, Flink or Spark or or, you know, a certain database. But, yeah, I wouldn't imagine, like, anybody to be, like, you know, have the whole breadth of knowledge as well as, you know, expertise and, like, you know, everything. And that's kind of, like, I guess, sort of impossible.
[00:14:13] Tobias Macey:
In terms of the team that you're working with, what are some of the ways that you as a group lean on each other to accommodate the knowledge gaps and understanding that each of you have and ways that you're able to work together to be able to understand the entire breadth of the problem space and help level each other up, particularly if somebody's coming in where they say, I got hired on because I'm an expert in Spark, but now we're dealing with iceberg tables or now we're dealing with Flink stream processing, and I'm way out of my depth here.
[00:14:44] Tulika Bhatt:
We definitely lean on each other, for that sort of, like, you know, help. We also are, like, you know, like, they're, like, you have dedicated, like, you know, platform teams who are, like, you know, experts on certain things, and their literally job is to have a, like, you know, eye on the market and keep evaluating what new technologies are out there. And then they also kind of, like, you know, bring up reports where they evaluate something which is, like, out there and say, oh, this is, like, you know think this is something, good, and we can, think about, like, you know, adopting this technology sort of, like, you know, internally.
Otherwise or whether they think that this might not suit our company so more. So it's actually, like, you know, I'm very lucky to have that resource. So they do the first line of groundwork for us when we are sort of evaluating if we need to go after something else and lean on them, get their expertise. If we do not agree with them or if I we feel like our use case is very niche, then we can also, like, kind of, like, you know, put forward a counterargument that, hey. I think this this this, like, you know, works for us and this suits us. So, as long as you have enough good arguments and data, points, like, you know, backing your use case, you can just go and use anything that you want over here.
[00:16:07] Tobias Macey:
And then particularly along the axis of reliability, observability, data quality management, when you are building custom components or extensions to some of those off the shelf frameworks? What are some of the principles or core libraries that you've developed to be able to manage the visibility into those, custom implementations so that you don't lose context or lose visibility into the overall flow and quality of information that you're processing?
[00:16:42] Tulika Bhatt:
This is interesting. So I think observability and reliability is still, like, you know, sort of, I would say, a little bit sore from them. But I think, like, data quality is something which is sort of, like, you know, there's not, like, straightforward, like, in a solved answer. We have a lot of, like, you know, custom tooling. But to be honest, I don't think so we have something comprehensive, which is, like, you know, taking care of, everything. We have, like, bunch of different tools which are, like, you know, discrete, and they're doing, like, they they are pretty good at one particular stuff, and we have to kind of use, like, combination of everything to make sure, okay, the data quality is, like, you know, fine.
So that's an interesting and, like, you know, unsolved space for us right now. And, for example, like, in data quality, I feel like we have solved, like, you know, schema issues. You can have, like, you know, schema or whatever checkers and, like, you know, and all of that. That all of that is working fine. We have, like, you know, volume based auditing and everything. That also works pretty well for us. You can, gauge it, if something is, you know, not going as as it is supposed to be. But one thing that missing in our tooling is, like, you know, semantic, I guess, semantic, auditing.
So, for example, if some event happens and you log something, your log output might be correct according to schema. The volume might be correct according to what the trends you are seeing, but maybe you have logged the wrong thing. And there's not a really good way on how to, you know, actually catch those sort of errors.
[00:18:28] Tobias Macey:
Another element of data observability is that it will also typically be at least somewhat correlated with observability of the operational infrastructure where the data didn't get delivered because one of the nodes crashed, and we have a split brain situation or we lost quorum, and so we're not able to actually move forward. And I'm wondering, what are some of the techniques that you have found helpful to be able to thread together that operational visibility, the operational characteristics of the underlying platforms with the actual data observability and data quality management to be able to more quickly understand what is the actual root cause, particularly when it's something operational and not a logical bug?
[00:19:17] Tulika Bhatt:
Operational bugs, at least at metrics, are, like, you know, easier to detect. So for example, like, you know, if we have, like, you know, robust system, like, you know, workflow, orchestrator, which is called, like, Maestro, I think it's an open source product too. So if, like, you know, if something goes wrong, we automatically like, you know, it kind of, like, you know, is integrated with our, alerting system, PagerDuty. So you would automatically sort of get alerted that, hey. There is, you know, this workflow, there's something wrong. We you know, it it threw, like, spark error or something.
Now if you're not a direct, like, you know, owner of the workflow, if you're, like, some some sort of consumer, we have, like, different sort of alerts. Like, you know, we have, like, you know, a time to complete alert. So we normally like, we can set alerts. So this workflow took takes, like, you know, three hours to complete. But if it for some day, it took, like, you know, four or five hours, so it just automatically, like, fires it. And it's like, you know, we can easily go on upstream and sort of, like, you know, check of whether, like, you know, what happened in the entire, like, you know, processing pipeline. Was there, like, you know, some failure or something?
Then we have alerts like, you know, freshness alerts. So, like, we call it, valid to timestamp, DTS. So something like, you know, if whenever a data is written, audited, and, you know, sort of, like, you know, processed, we release, like, you know, a flag saying, like, you know, this data is corrected, like, you know, fresh and corrected, like, this particular time stamp. So you can have those alerts too. So if we feel like, oh, this data has not been, like, you know, freshened up for a while, it's it's, like, you know, it's been, like, five hours, and, we didn't get any fresh data from that particular source. So it automatically kind of, like, triggers another alert, and then it is routed to, like, you know, you can set it up like Slack, PagerDuty, or whatever and just get automatically notified that there's something wrong over there, and you can go and investigate.
So I would say, like, operationally, like, it's like I think it's it's in a okay situation, right now. We're also sort of trying, some a little bit of, like, you know, self resolving alerts where if we feel like, you know, a data like, you know, we were supposed to receive suppose 10,000 records and we only received, like, 8,000 and there might be a late arriving, data issue, then there are, like, you know, self healing pipelines where, like, an auto backfill sort of, like, kicks off and automatically kind of backfills the data. So, like, sort of, like, slowly, incorporating that. It's not, like, a % there, but, yeah, that's that's also one way of how we sort out operational issues.
[00:22:11] Tobias Macey:
And one of the aspects of operating large, high uptime systems is that they require automation to be able to actually sustain them. Whereas if you're running a smaller scale, you're running lower volumes of data, you can maybe get away with manual processes for error detection, error correction. And I'm curious what you see as lessons from working at these high uptime, high scale environment that are translatable to some of those smaller scales where you can say, oh, it's easy to automate this thing, so it just should just be part of the standard operating procedure no matter what your scale and what are the aspects of large scale automation principles that don't translate as well to medium to small scale systems.
[00:23:04] Tulika Bhatt:
I know, like, in, like, you know, smaller scale system, you can definitely get away with having a lot of, like, manual processes, like, you know, alerting, backfilling, or whatever. But I do think that it's important to keep these things in mind while you are designing this particular these kind of, like, you know, systems because you never know when like, you know, in a blink of eye from your small scale systems, you just end up in a, like, in a in a in a place where you're suddenly getting much more events than you originally planned for. So, for example, I feel like even though we have a lot of things that are automated at, Netflix, having a good alerting strategy is still painful.
Like, we either under alert or over alert. So which both of them leads to their, like, you know, own set of, like, problems. So I think, like, at least from in my experience, whenever I'm thinking of, like, designing a new data pipeline, I always kind of, like, you know, think about, okay, what what are we measuring? What are the, like, you know, the who am I consumer? What's the impact? What should the alerting strategy be? I think that's a that's a good, I would say, thing to do even if you're starting with, like, you know, a smaller pipeline. Having, like, you know, having a good runbook even just for, like, you know, practice or something goes wrong. This is my runbook. These are all the details over there.
Do I need anything out of my like, you know, outside of my runbook if I need to solve this particular problem. And and the answer should be no. Your runbook should be complete. But that's a good exercise to see. Sounds easy, but I'll tell you it's more often than not not easy. So but just getting in the habit, like, you know, right from the very beginning, I think it would be, like, you know, useful thing to do.
[00:25:02] Tobias Macey:
And, obviously, the topic that has taken everybody's attention for the past couple of years is AI and all of its various applications. As data engineers, there are and software engineers and engineers working in technical systems, there are definitely ways that we can use generative AI to automate and accelerate our work. But there is also a challenge of being able to feed it the right context, feed it the right understanding about the problem that you're trying to solve. And I'm wondering how you're seeing that factored into the work that you're doing at Netflix, and in particular, given the size and complexity of the systems that you're operating, being able to feed enough of that architectural knowledge into the AI systems to be able to get them to give you useful outputs without just having to fight with it and go through, an untold number of rounds of prompting.
[00:26:06] Tulika Bhatt:
To be honest, I would say that we have had, like, mixed results at Netflix. So for example, like, we do have, like, an internal version of, like, you know, a search which is kind of trained on our, like, you know, on the internal documentation because that, bot has, like, sort of, like, you know, overall view of documentation, so it's definitely doing better. We do have, like you know, we can, like, sort of, as fun projects, create our own Slack bot team on our, like, you know, support questions or train it on our run book and help them to answer, like, you know, questions. I think those have had mixed results, like, not super great. Sometimes it's just, like, feels like there's more effort in, like, you know, training the bot and making sure it's answering the right thing than just going and searching the answer by myself and answering it.
So but we definitely, like, you know, use Gen AI tools, those bots for, like, you know, developer productivity. We use it, like, you know, we have, use it for, like, auto completion or in notebooks just to, like, you know, prompt, I guess, fancy auto complete. And then, we definitely do integrate it with our pull request to give, like, you know, some sort of, like, suggestions on, like, you know, code quality and all of that. I would say six out of 10 times, it gives pretty okay con comments. Otherwise, it's, kind of like you know, sometimes it's Mhmm. Just doesn't work.
Other time, I think one other use case that's, pretty good is, like, you know, we kind of integrate it with, like, our build boards or, like, you know, our workflow blog board. So if there is any error that happens, so it automatically kind of reads the log and bubbles it up. So you don't have to and most of the time, it does create. So saving clicks, I would say, instead of you going and, like, now going through Spark and, like, you know, or going through Jenkins and getting the logs and see, okay, what happened. And it it, like, bubbles it up so that makes things faster. What else?
Yeah. I think that's the, I guess, breadth of use cases we have right now for the for LMS right now.
[00:28:19] Tobias Macey:
And as you have been building systems, working at Netflix, tackling the various data challenges that are faced by the scale at which you're operating? What are some of the ways that you foresee or would like to see some of the off the shelf technologies or some of the adopted best practices evolved to incorporate some more of this AI automation and intelligence into the actual processing layer to allow human operators to move even faster and not have to worry as much about the low level minutiae of the bits and bytes and work at a higher level of problem solving?
[00:29:04] Tulika Bhatt:
I think, like, maybe, with things things get better with LLM, I I think it could be, like, you know, more helpful in, like, you know, in the actual, sort of, like, prompting for, obviously, like, you know, code completion, writing code, or something. Actually, let me take a step back. So when I'm designing this, like, you know, any large scale system over here, I think the first step is actually designing the data model, coming up with it, the criteria of when that event is gonna fire, how that event looks on, like, different clients, client devices, and all of that. I think for that, I don't know if there's a good way to sort of, like, get it solved by, LLMs.
But probably, like, you know, it's just, like, needs so much context, and it's, like, going to be very difficult to provide all of that context to what and expect, like, great answers. It can be a good, like, you know, starting step, but it's just there's so much work. So I don't envision any, like, you know, it being useful in that aspect. But once we have, like, say, a data contract or something, probably, we could, like, you know, be integrating it with, like, you know, these, bots and sort of, like, you know, generating code from that. Also, like, you know, planning, it can little bit help on, like, you know, planning on the architecture side. So for example, if I need a fling job and I could just feed it, okay, these are the like, you know, I'm expecting this much, like, you know, volume, this much, like, you know, each event is gonna be this size. Like, can you propose, like, you know, what's the starting sort of, you know, cluster side cluster size? I can, like, start off with to accommodate all of this. So probably if it if it could, like, you know, getting all of this input, like, create, like, you know, the first initial cluster for me, I think that would be great if I can just get to that point.
[00:31:01] Tobias Macey:
And another aspect of the ecosystem is that a lot of the tools that have become the widely adopted standards are getting to be old in technological terms where they've been around for a decade plus. Spark in particular was built in response to Hadoop and the challenges that people were facing with Hadoop. There have been various generations of successive technologies that are taking aim at Spark and Flink. I'm wondering what you see as the forward looking viability of Spark and Flink as the primary contenders in the ecosystem now that there has been enough understanding of their characteristics, the benefits that they provide, and the shortcomings in terms of their technical architectures and some of the ways that newer systems are designed to address some of those challenges.
[00:31:56] Tulika Bhatt:
I don't think, like, I see something which is going to be a replacement for, like, you know, Spark or Flink as of now. I do think, like, there's a lot of things that needs to be better with, like, you know, Spark and Fling, for example. I think we just started auto scaling on Fling. I mean, I don't know, about, like, the, like, what version it it was released on, but I think we just started, internally with auto scaling of Fling before. If any, like, you know, thing this is one of the bigger problems that we have in the data engineering world where, like, you know, if you we didn't have, like, you know, systems that would automatically scale a light, you know, in software engineering, which is, like, you know, it's kind of, like, stateless. And if something, if there's more traffic, automatically, you can set up, like, you know, traffic guards or something and then, like, auto scalers and they scale it. For Flink Job, we always had to manually scale it whenever we would get, oh, there's a lot a lot of events coming in, consumer lag increasing. Even for stateless job, we had to do it. So I think they started with that. I don't think so we have a solution for stateful jobs as of yet. So that's going to be an interesting problem to solve. As things become more and more real time, you obviously don't want a human in there to be actually, like, you know, operating your, operating your process size and, these things can be taken care of. It'd be great. And, also, for example, for Spark, even though we have, like, optimizations on Spark, it doesn't really actually work all the time.
Also, like, you know, if there is, something, like, you know, hot partition or, like, you know, skewness or something, it requires a lot of, like, you know, manual, I would say, intervention. The regular, like, you know, parameters are not working. It just, like, fails, and then you have to sort of retune it and redone it. So I think that is also another, like, you know, bigger problem which has not been solved in Spark. So even though these technologies have been here for a long time, there's still a lot of, like, you know, improvements that needs to, you know, happen so that we can keep up with the, with future where, like, you know, more and more real time data needs will arise.
[00:34:06] Tobias Macey:
And in your experience of working in this space of high volume, high speed data, and high uptime requirements at Netflix, what are some of the most interesting or innovative or unexpected ways that you have seen either your team or adjacent teams address the design and implementation of solutions to those large scale data management problems?
[00:34:32] Tulika Bhatt:
This is interesting question. I think the first one that comes to my mind is, it's in the impressions project, so it just immediately streamed to my mind. So there was this use case of providing, like, you know, a year worth of, impressions for certain models. And, you know, providing a year worth of a person's impression is, in real time is just like, you know, an impossible ask because I think one person would see around, like, you know, maybe hundred thousands or even more impressions in a year is just, like, a really big ask. So we were thinking about the ways on how we could, like, you know, achieve that, then it was, like, you know, the the raw data, that was an impossible answer, like, you know, getting that in real time and, like, you know, reasonable latency. So we were like, okay. What if we can aggregate it? What if we can reduce it somehow? So that's how we came to, like, you know, using something called as, like, you know, EMEs. Just like, you know, taking, impressions and converting into numbers using, like, formula. So now from object, we went to numbers, but still, like, you know, one year is it's a lot. And, also, like, you know, you gotta have some job which process one year of impression and, like, you know, converts it into numbers. So we came up with, like, you know, having a Spark job that would do the scrunching and store it into Iceberg table. Now all is good and fine. We have data in Iceberg, but we cannot expose Iceberg to GRPC service. And so from there, we need, like, you know, some something else, like an online data store from there.
So I this is the project I told you about where we actually sort of, like, you know, we had to come up with an a new technique of how we can, like, you know, take this entire gigantic universe of data and kind of, like, you know, sort of upload it into an online data store. So we sort of devised a clever technique where what we do is, like, you know, we upload this data sort of, like, you know, weekly. And then we have another sort of, real time process which kind of, like, you know, service which takes these, impressions, does some, like, you know, online crunching for a week and takes week old data from this computed week old data and just, like, combines it together and sort of, like, you know, provides real time one year worth of impressions. So I felt like that was an interesting use case because, normally, we have used Spark only for, like, you know, analytics serving analytics purposes, and it's, like, after the fact. But this time, we use Spark and Iceberg in the actual, like, you know, actual operational flow, and we kind of, like, you know, we are powering, like, you know, a use case in the, like, you know, in in real time system, like, GRP service.
And we have, like, this, Spark and, you know, Iceberg and equation. So I felt like this was an interesting, project, like, where you used both software engineering and data engineering and kind of combined together and built, like, one product. I guess there are other examples too. Just like whatever really fulfills our requirement, we are always open to open for innovation and going beyond, like, you know, normal ways and getting that. And in your work of
[00:37:35] Tobias Macey:
operating in such a high demand environment and learning more about the various primitives involved in building these data systems and maintaining reliability, what are some of the most interesting challenging lessons that you learned personally?
[00:37:51] Tulika Bhatt:
I think, to be honest, I feel like at least in, this world, technical problems are, like, you know, easier to solve. It's more just like organizational problems that are harder to navigate, and that's been my lesson. So, for example, like, you know, when you are dealing with, impressions, now impressions are going to be created on, like, you know, client devices. And by client devices, I mean, like, you know, the it it's going to be web or TV or your phones or something. So, like, you know, being generated over there. Now all of these client devices, they'll have their own limitations on how they can generate that event, whatever, like, you know, how much, like, you know, that particular client, like, you know, has, like, you know, logging facilities or whatever on the device and how much it can capture and whatever, like, you know, how much date, like, capabilities the manufacturer is providing them. So all of these, like, limitation kind of, becomes really important to know when you are sort of, designing an event because now you have all of these constraints that you need to keep in mind, which, like, you know, which maybe earlier you didn't need to know. You just needed to know, like, you know, this is my API or whatever. Now I need to know, okay, how what are the constraints of their systems?
Then, what is their release cycle? What is their, like, you know, how are they testing their logging artifact? How are they doing canaries? How are they doing AB testing. So all of that is, like, you know, there's just so much context required to do work in this space. So this is what which has been, like, an interesting observation for me personally.
[00:39:26] Tobias Macey:
And as you continue to work in this space, invest in the reliability and capabilities of the data systems that you're responsible for, what are some of the lessons that you're looking to grow in, some of the resources that you rely on staying up to date and adding to your understanding of the space, and just general advice that you have for folks who want to be able to move into a similar position?
[00:39:53] Tulika Bhatt:
For me, personally, I am really invested in learning, like, you know, what's up with, like, you know, data quality. Can we finally get, like, you know, one solution that fits all and can solve most of my problems? So that is what I'm, like, you know, personally invested in. Like I said, like, currently, my pre my my project is to go with the producers, sit and understand their use cases, like, how they you know, what constraints they're operating in in order to produce an event, how does the whole, dev life life cycle goes through. So I think this is a a good exercise, and this is something I would encourage others also to do. Like, you know, I don't know if it is just me, but I feel like, you know, it's easier to, understand how data is being used by consumer, whereas the producer aspect of it is something becomes, like, you know, more black boxy. So, like, you know, just digging in over there and understanding what's happening in that world, which it and that could help with you, like, you know, having better strategy for your, you know, data quality. You don't have to you can actually stop bad data from going in if you are more, like, you know, plugged in with how your consumers are doing their testing and their whole network life cycle. So that would be one advice, be more plugged in with the data production process.
And, regarding how I keep up, I think I keep up with things normally as, like, any other folk in the field is keeping up, like, you know, just, like, reading the type of blogs of other, like, you know, other companies, other newsletters, podcasts, and, like, you know, even, like, conferences just, like, listening. What's happening? I know there is, like, you know, some work going on on data contract space. There was an open source project, so I'm curious, like, you know, what that will lead to. Yeah. And there there's, like, you know, some work which has been I think LinkedIn has a open source, sort of, like, data quality platforms. I'm following up over there too, and I'm seeing if that can be something that we can actually adopt and use for our use cases.
[00:41:51] Tobias Macey:
Are there any other aspects of the work that you're doing, the lessons that you've learned working on large scale data processing systems, or the predictions
[00:42:01] Tulika Bhatt:
or wishes that you have for the future of data systems that we didn't discuss yet that you'd like to cover before we close out the show? I think working in, you know, small scale world as well as moving to, like, in a large scale world, I think one thing has become clear with me is you need to definitely have good fundamentals. So, I mean, in small scale world, probably, you can, you know, get away if you are not, like, you know, thinking about the architecture that much. You're not thinking about, like, you know, reliability that much or, you know, alerting or whatever. If there are, like, you know, inefficiencies, you can just cover up them with, like, you know, manually fixing things and all of that. But all of that really explodes when you go to a large scale system. Like, if you don't think about the design carefully, even, like, you know, minutes of, like, you know, downtime can mean, like, millions of events are now gone and can have have an impact. So, like, you know, it becomes very imperative when you are, like, you know, designing any system to think about all the challenges.
Also, like, you know, not only, like, you know, about reliability, but also from cost standpoint, like, you know, also think about that. Like, okay. I'm going with this technology. Like, how much is it's gonna cost? Also, like, really negotiating about the amount of data that you need to have, that you need to store is, like, you know, do you really need that data? Because all of this is, again, is gonna cost you to, you know, process, store. So all of these, decisions just become really important as your scale kind of increases. That would be my advice. Like, don't forget the fundamentals, lead designing data applications.
[00:43:44] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:44:00] Tulika Bhatt:
Yeah. Definitely. I think, I already talked about something. We need a better solution for data quality or for a semantical, like, you know, answers for data quality. That's one. We need, like, you know, definitely, you know, more reactive tooling for, real time purposes. So, like, you know, I don't know, more, autoscalers for, for, Fling, stateful jobs, then definitely we need more sort of, like, you know, kind of investment in, performance optimization for Spur and, like, you know, less sort of manual tuning and manual feedback. And over there, I think that is, like, another unsolved space that we have for now.
[00:44:42] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences of working at Netflix and being responsible for high speed, high volume, and high uptime data systems. It's definitely a very interesting problem space to be working in. Has a lot of valuable lessons to be learned from it. So thank you again for, taking the time to share that with us, and I hope you enjoy the rest of your day. Yeah. Thank you so much. It was really nice talking with you. And, yeah, we had a lot of, like, you know, great conversation,
[00:45:12] Tulika Bhatt:
a lot of, questions that I would also think about later in my day. Thank you so much for taking this.
[00:45:25] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Tulika's Journey into Data Engineering
Understanding Impressions at Netflix
Technology Choices: Build vs Buy
Learning and Knowledge Sharing in Data Engineering
Observability and Data Quality Challenges
Automation in High Uptime Systems
AI's Role in Data Engineering
Future of Data Processing Technologies
Innovative Solutions in Large Scale Data Management
Lessons Learned in High Demand Environments
Closing Thoughts and Future Directions