Summary
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing your definition of a data pipeline?
- At what point in the life of a project or organization should you start thinking about building a pipeline?
- In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?
- What metrics/use cases should you be optimizing for at this point?
- What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?
- How do the design requirements for a data pipeline change as you reach this stage?
- What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?
- What are some of the changes that are necessary as you move to a large scale data pipeline?
- At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?
- In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?
- When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?
- Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?
- What are your selection criteria when determining what workflow or ETL engines to use in your pipeline?
- How has your preference of build vs buy changed at different scales of operation and as new/different projects become available?
- What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?
- What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?
- What are your plans for improving your current pipeline at Grubhub?
- What are some references that you recommend for anyone who is designing a new data platform?
Contact Info
- @sirchristian on Twitter
- Blog
- sirchristian on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Scaling ETL blog post
- GrubHub
- Data Warehouse
- Redshift
- Spark
- Hive
- Amazon EMR
- Looker
- Redash
- Metabase
- A Primer on Enterprise Data Curation
- Pub/Sub (Publish-Subscribe Pattern)
- Change Data Capture
- Jenkins
- Python
- Azkaban
- Luigi
- Zendesk
- Data Lineage
- AirBnB Engineering Blog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy, and today, I'm interviewing Christian Heinzmann about how data pipelines evolve as your business grows. So, Christian, could you start by introducing yourself? Sure. Christian Heinzmann. I'm the, director of data
[00:01:11] Unknown:
engineering on data warehousing for, Grubhub. I've been there for about, 2 years. Really what I'm been working on is dealing with our ETL pipelines. We have, 2 data warehouses. 1 is a more traditional. 1 is our more big data pipeline. That's, what I do day to day. I've learned a lot of things along the way. Made some right decisions, made some wrong decisions. Alright. And do you remember how you first got involved in the area of data management? Yeah. So when I this is I started my career as a more of a traditional software engineer, but a lot of my work was heavy heavy dave data processing. So there's a lot of scraping of websites, cleaning up that data, storing it. As I I was going through that, I sort of got very interested in business, just how what does the business need, sort of pivoted my career toward a little bit more towards startups.
And in the startup world, started building more of our analytical data warehouses, which let me interface with all the areas of the business, which was something I was interested in. Got really interested in business and data warehousing, and the how you deal with all this data, and started moving up to, through the startup world, ended up going to to Grubhub where we actually have, large amounts of interesting data problems. That's for him. And speaking of interesting problems,
[00:02:34] Unknown:
the fact that you have 2 different data warehouses for, I'm assuming, slightly different purposes, I'm sure it poses some unique challenges as far as how you're processing the data and making sure that everything stays in sync. So I'm wondering if you can just briefly talk about the sort of main use cases that each storage location serves in terms of the broader business needs. They're trying to serve,
[00:02:58] Unknown:
very similar use cases. In some ways, it's really we're trying to how do we do this data warehousing at the scale that we are? There's so we do our more traditional data warehouses in in Redshift which is really good for fast ad hoc queries as long as the the server is not under too much load, it's great for that. But what it's really had challenges on was scaling the right part, in the loading part. So that was really what was started the move to a more of a a big data stack. So our big data stack, this is a lot of our ETL is done with high Spark and Hive on top of Amazon Amazon EMR. We can spin clusters up and down, which basically gives us almost infinite amounts of compute power, which really helps our right scale. And then that's that's probably a big difference of having that in that data warehouse, really helping with our right scale. And there's there's definitely challenges keeping them in sync. Sometimes we try to sync from 1 to the other. Sometimes we we don't. It's a, work in progress, it sounds like. Yep. Yeah. It's definitely 1 of these. I think this is 1 of our some of the challenges came from when I was talking about scaling ETL is that we weren't able to scale, with our Redshift. How do we there was a lot of other choices how do we done. Probably would have been a made this transition a little bit easier. And so given your experience
[00:04:24] Unknown:
at startups and now moving to Grubhub and trying to scale the capabilities for data processing there. You ended up writing an interesting blog piece that inspired me to reach out to you and talk a bit more about your experiences of building ETL and data processing pipelines for some of these different scales of organization and data volumes. So I'm wondering if you can just start by sharing what your definition is for how you think about a data pipeline.
[00:04:55] Unknown:
Sure. So when I say data pipeline, I actually have a very probably simplistic definition. I would say anytime you're going to move or transform data from 1 place to another, that's your data pipeline. It could be something very simple. You can just have something like a cron job that does a SQL query. That's your data pipeline. Could be as complex as having a dedicated scheduler pulling data from, streams and from multiple data sources to combine them together that ends up having multiple steps in your transformation jobs, but I would call all those data pipelines. I think a lot of people may end up thinking of data pipeline as the latter, but I definitely just say it's anytime you're moving data, you're building a data pipeline.
[00:05:46] Unknown:
And so given that very broad definition, anytime that you need to deal with data at all, you can start thinking about that in terms of pipelining operations. And so, in the beginning of the post that you wrote, you were describing that when you're first building an application or starting to try and build out an organization, that your pipeline should be very simplistic and mostly manual. So I'm wondering if you can just discuss some of the approaches that you take at the early stages of a business and small scale data and some of the design characteristics that you should be targeting for that type of pipeline? Yeah. Right. The in a word, the simple
[00:06:27] Unknown:
try to make it as simple as possible. At that stage, a lot of times you don't quite know even if you have a product market fit. So all of your engineering resources or most of your engineering resources should be gone into actually figuring out product market fit, whether that's tweaking the products or figuring out who to talk to. That's where a lot of the your time should be spent and less time actually building up any sort of scalable pipeline. And it's so besides just the engineering side of that, we're not spending engineering time on it. In order for especially, like, a starting of a project, if you're following more of a lean or agile methodology, wherever you're pulling the data from is going to change a lot. So if you have something simple and and lean that you can actually help that you can change quickly, is probably the better way to go, really. And I wouldn't even be looking at any sort of complex metrics at that point either.
So anything that we'd look at some really, like, high level metrics, something like how many customers did you have. I wouldn't maybe not even worry about conversion rates. Revenue, I'd probably get. And this could all stuff that you can pull probably right from whatever production or transactional system you have. So at at that at that stage, you wouldn't even need a dedicated, like, data warehouse. You're that scale that you're having, there is I'm sure there is some time in the day your database is free enough that you can issue a couple SQL statements to get some data out of it. That's so simple in a word. Yeah. And as you're saying, just being able to run a few SQL statements and dump it out to some CSVs
[00:07:54] Unknown:
and just do your processing in Excel or whatever spreadsheet program you use should be sufficient, and that has the added benefit too of saving the engineering resources that you could be spent building the product, but also not bogging down anyone else in the business who is trying to gain some insights from that data in terms of having to train them up on using whatever tool you're leveraging to be able to create the reports, because pretty much everyone knows how to use a spreadsheet. They can do their own analyses. At this stage, there isn't really enough complexity or enough different transition points that you have to worry too much about having sort of like a golden master of the data, where you're worrying about different people getting different insights from the same resources because you're all probably gonna be in the same room and can just sort of talk over it. Right. Yeah. And you you don't even have to worry about the, breaking any limits of Excel or Google Sheets. They you won't have that much data.
And so as you start to build out the applications or build out the business and gain more customers or more data, what are some of the indicators that you look for to be able to signal that you're starting to reach that next order of magnitude in terms of scale of complexity, scale of data, scale of the organization where you need to redesign the requirements for your data pipeline and some of your considerations of how you would approach that rearchitecture.
[00:09:21] Unknown:
Yeah. That's a it's a good question. I think it's something that probably it's gonna be hard to to see if you're not looking for it, but there's definitely like, we talked a little bit about, like, product market fit. As you're starting to gain some traction, this is definitely gonna be a time where you should start paying a little bit more attention to, alright, we're gaining some traction. We have some users. We have some idea of how the business runs, at least today. You'll start getting more insights into either what your product's doing, how people are interfacing with your products, how if you're doing any sort of things with logistics, how your systems are operating, You'll start be people will be able to ask more sophisticated questions of things. They'll start trying to wanna optimize a little bit more. They'll see inefficiencies, and that's really the time where it's really good to start actually laying down a little bit more of a dedicated analytical system. You'll also start seeing, different access patterns of your data. If you had some bizarre, very simple pipeline, people are gonna start asking for a little your SQL queries. You're gonna get a little bit more complex.
Start being a little bit more sophisticated. It's these are all signs that, alright, maybe we should be building a more dedicated analytical system up, particularly since those queries that you'll be building will be more than likely more aggregate in nature, and the your production data warehouses will more likely be transactional in nature. So having them separate for separate use cases around that time starts to make sense. And as you're starting to approach that medium scale, even if you don't have a dedicated data warehouse in place yet, 1 potential beneficial next step beyond the spreadsheet approach is to start employing some sort of businesses intelligence
[00:11:05] Unknown:
tool, whether it's something like Metabase or Redash or Looker so that everyone has 1 view of the data. They're all using the same queries instead of everybody crafting their own aggregates. So that way you at least have some commonality in terms of the information that people are seeing, and it can store some of those computed aggregates within that business intelligence platform before you get to the full scale of having a
[00:11:31] Unknown:
data warehouse or a data lake. Yeah. And actually, you touched on something that's very dear to me is I I'm a very, actually, I think I listen to your, I listened to 1 episode of your podcast, I forget, by who we're talking about curating data, and that's actually 1 of the things that's very dear to me as well. I think that's sort of like that stepping start stepping points with these these common queries and access patterns. You start knowing what to start looking at to curate.
[00:12:01] Unknown:
So as you start to reach a more medium scale in terms of the organization and the data volumes that you're working with, what have you found to be some of the complexities and challenges that begin to present themselves as you start to build a more production grade data pipeline and run these jobs on a more frequent basis?
[00:12:20] Unknown:
Yep. Yes. So there's there's definitely challenges. I mean, if only any sort of data pipeline is going to have some amount of brittleness is it's hard to use the word brittleness, but there's definitely as data's moving, there's a lot of potential for change. And once you're at this sort of medium scale, you're probably don't wanna take the time to decouple everything completely, which means you have things a little bit more tightly integrated. Some of the changes in 1 system may cascade into bigger failures. And then just in terms of build, what is the right amount of things that you should be putting inside of the data warehouse? I've definitely seen cases just organizationally people get very excited about, okay, we're finally we're we're gonna have a more of an analytical data warehouse. They want everything in there.
But that's it takes time to get everything in there, and they may not actually look at all of it. So really, it's in some ways taking some of the learnings from the the earlier forms of the queries and dashboards, spending a lot of time on the real important pieces of data that people are looking at And then also as you're building it out, you're gonna have your issues with wait. Now it's a production system. Let's make sure and this is something I've seen people skip or not quite go as deep, like this is an easy part to miss some of the monitoring of your pipelines, tends to be something that just falls off sometimes. It's 1 of those it's working. People are analyzing the data.
Quite realize that it's actually a production grade system. At some point, it it's a it's an interesting transition once it goes from a, okay, we're just hacking together not hacking, but we're we're just pulling together, SQL scripts, and it's important, but not a production system to this is actually turning into a production system. We should actually have production good type controls on it. Yeah. And I I think 1 sort of good
[00:14:20] Unknown:
metric to measure how much of a production system it is is how often people complain when things stop working. Right. It's at the early stages. It'll be, oh, it's broken. Nobody noticed. Okay. This is fine. But as more people start to realize that there is the system in place, that it is valuable, that they can gain some useful information from it, then they'll be more likely to let you know when things stop working. And then you start to realize, oh, wait. I need to put some more quality controls and, metrics and alerting in place to make sure that this stays running when I'm not looking at it actively.
[00:14:51] Unknown:
And, I mean, in in some ways, you're actually your fixes end up being a little bit harder to deploy as well. So, like, before if you have just SQL script, oh, it didn't work. Let me tweak my SQL script and just put it back onto the server. Once you have a get a little bit more of a dedicated system, there's usually a little bit more of a more robust release process, which may take a little bit longer. You'll probably have multiple a lot more people looking at the the code. So there's definitely PRs, reviews, and that sort of thing. So And another problem that can start
[00:15:22] Unknown:
to make itself known even at the medium scale, but particularly going into as you get to a larger scale and more complicated analyses is trying to minimize the impacts on the source systems that you're pulling the data from to be able to populate these data lakes and data warehouses. So I'm wondering if you have found any particular strategies that are useful for trying to prevent any sort of production impact on the applications that are using those data sources as you're building these aggregates and, doing these extractions?
[00:15:56] Unknown:
Yeah. No. It's actually very it it could be a big problem. When you're gonna try to pull especially, some of these queries can be quite intensive. You you don't wanna take down your production system or even slow it down. We say there's a different sort of depends on in some ways it depends on your your stack for your production system. I can backtrack a little bit. I would say, oh, 1 thing that I find is very important, this is kinda going back to my software engineering background and thinking about things in terms of who has what responsibility and how do you can you appropriately, decouple things, I would say if you could start putting in any sort of more well defined interface between your production system and your analytical system, this could either be putting some well defined data onto, shared storage or it could be some sort of streaming platform or any sort of, like, sort of pub sub architecture that a analytical system can plug into, this well defined interface between the 2 systems, you can kinda put that if you put that in early, that's going to help out with not impacting the production system, because that's usually, they just have a very small amount of work to do if and just push once, and then it's done.
And then when you're reading it, you don't impact production at all. But that's that said, there's also other kind of strategies you can have. If you have depending on your the type of production system, transactional system you have, if it's more of a relational database, of something that works really well is just using the capabilities of having a replication node and just your analytics point to the replication node. That works really well. But probably the, and as you get a little bit more if it's not relational, it get a little bit more into, like, the NoSQL land. We've pulled stuff in from different backups and other things that aren't actually talking to the production system. It's another process of the production system pushing data somewhere. So it it knows its access patterns better. But I'll repeat about that. If you had some sort of streaming system that helps a lot. Yeah. And I like your point too about having a defined interface for being able to pull the data from because that can help reduce some of the
[00:18:13] Unknown:
brittleness in sourcing the data. Because if you have maybe a defined API, particularly if it's well versioned, then you can predictably have the same shape of data and same structure of data each time you're running these jobs rather than having to worry about any underlying database migrations that might occur as part of the application life cycle and then having that break your data loads and transformations because there's either an extra column you didn't account for or a column's been renamed or a data type has been changed. But having that API, you're more likely to maintain consistency, and you're more likely to have a discussion, particularly as you start to break up your teams between software engineering and data engineering of having that established interface to couple those 2 systems and those 2 organizational teams. Yep. Completely.
And, as you mentioned, streaming systems, another approach would be to use something like change data capture, which reintroduces some of the potential for brittleness as the structure of the database changes, but helps to reduce the overall impact on the source systems because you're not using up computational resources at a web layer via using some sort of API. But it increases the complexity and the challenge on the data engineering team to be able to reconstruct the data from those, changed data records. Yep. Yep. Yep. And so, as you again start to go beyond that medium scale and into another order of magnitude into so called large scale and big data systems and start to integrate multiple data sources together beyond just what your applications are producing. I'm wondering again if you have any sort of indicators that signal that you are starting to reach that next order of magnitude and some of the ways that you start to consider redesigning your data pipelines and some of the approaches that you would take to be able to build more high level and complex,
[00:20:09] Unknown:
aggregations and metrics and analyses on top of those different data sources? In some ways, it's gonna look similar to when you went from small small to medium. You're gonna start some of your parts of your system will get more stressed. Also, you'll have you wouldn't necessarily have similar queries, but you'll have people asking for the same metrics over and over again. I would say that that's 1 big indicator that your product or organization is becoming mature. If you've done some of the medium scale place, I think coming to large scale becomes a lot easier. But if if you haven't, there will definitely be some you'll hit some limits in terms of processing. You won't be able to keep up with the volumes of data. Your data increases by 1, 000, 000 or billions of records a day. These are all sorts of things that indicate you probably want a little bit more of a large scale system. In some ways, even just having the number of different sources you wanna integrate with is also sort of an indicator. As organizations grow, I found the tools aren't necessarily standardized cross teams. If you have your sales team could be broken up into different sorts of sales, and they may be using different sorts of CRM systems. Your marketing team may be doing different sorts of marketing. You may they may be tracking those forms of marketing in different systems.
Your operations team may be looking to interface with, some other tool that lets them really understand how the business is operating. And I think once you start realizing be having all just these number of things, that's another sort of indication that you're getting into more of a larger scale.
[00:21:42] Unknown:
And so particularly when you reach the large scale of data and organization, but even potentially at medium to small scale, there's been a much bigger focus on using data lakes in place of data warehouses or in supplement to them. So I'm wondering what your thoughts are on that overall approach and how you see data lakes fitting into the overall analytical infrastructure for a given organization?
[00:22:10] Unknown:
I I really like data lakes, particularly when used, appropriately. I think when, I was first hearing the concept of data lakes and, some of my peers first heard the concepts, there was half some some were really excited. They didn't have to do any work anywhere. They just dump everything in data lake and they're done. Others were afraid of, well, if everything all the data's there, how are people gonna analyze it? And I think both those fears are sort of valid, so I think you need to having a having a data lake really helps kinda decouple inside your data pipeline. The data lake can kinda be a a staging area for a lot of different data. It lets you have this persistent storage sort of in the middle of what I would call it a full ETL pipeline. So if your your system or whatever you're using, you're using a query and the query breaks for some sort of curated data, you don't you probably don't have to go all the way back into the transactional system. The data's captured already under the data lake. So it's kinda lets this you can debug and develop solely in an analytical kind of environment. But I think the data lake has to be managed. It can't just be dumping ground of things. You still need to make sure you have organization in there. You have to make sure you have, access patterns. It it it's it's an important I think it's an important piece, and it helps a lot with scale. Long as it's not treated as like a dumping ground. It it works really well. Yeah. You don't want it to turn into a swamp.
Correct.
[00:23:32] Unknown:
And, as you mentioned, a lot of times they're used as sort of a staging ground for the data after it's been extracted from the transactional systems or from the 3rd party data sources. And the transformations can then be performed off of that staging ground. So that can help minimize some of the potential loss of fidelity or loss of information from either bad transformations or ill considered transformations. So I'm wondering if you can talk a bit about some of your approaches on that front to try and reduce the impact of transformations on the, quality and efficacy of the data that you're processing?
[00:24:15] Unknown:
Yeah. There's definitely 1 thing that's sort of nice having a a deal. Like, it kinda is a safety net in some ways. It kinda you're not going to lose anything, especially if you're gonna do certain, other transformations. You can actually but sometimes, actually, some of the transformations you'll intentionally wanna discard certain amounts of information, but maybe you have, outliers that you wanna clean up for certain workloads or you want to there's some form of records that you know are usually some sort of test pattern or maybe it's somebody doing something weird with your transactional system, and this isn't the table that is going to care about that person doing something weird. It really wants, like, how actual real users are using the system. So you can you'll actually send a throw those records away. But when people are using that, sometimes questions can come up, and you can actually always go back to the, more raw source data in your data lake, analyze that, say, okay. This is why we've discarded these records, and then maybe we need to tweak how our logic is. And then you can always run a backfill of on the downstream ETL, whatever that table was, so you can clean that up.
Like, other ways, there's definitely, having, valid diction works, validation frameworks, when you're doing ETL and transformations also helps. That's another thing I need balances and trade offs on. I've seen some validation checks be a little too aggressive. Sometimes something weird is actually something normal. And if you do too aggressive, then you'll start failing when things shouldn't actually be failing, but too loose, and then you're gonna have your your bad data. So I'd say validation frameworks are important, just have to be used, appropriately as well. But, again, back to the it's nice having that safety note of knowing my source data is there. I didn't lose anything. I'm not going to take down production because I have a data lake on some big storage.
[00:26:15] Unknown:
And in terms of the actual workflow engine and the, the tools that you're using for performing these different stages of the pipeline. I'm wondering what you have found to be useful selection criteria in terms of the technology that's being used and the way that it fits into the organization and the team that's leveraging it.
[00:26:36] Unknown:
Yeah. I'd say when when researching for, like, the technologies and the engines that we'll use, I really want a balance of, like, ease of use, kinda the features, and how flexible it is. I really wanna make sure that it's something that fits inside of the environment. So, like, at Grubhub, we have certain tools that we standardize on, and so whatever we pick should be able to incorporate with Jenkins. Our data team is a Python shop. It should be able to people should be able to interact with it in in Python. So being able to make sure it it fits within the skill set of the organization and the tooling of the organization is probably 1 of the more important things I would look at. But then other features I would look at, things that are nice for me when I'm looking at, things like the dependency management, anything that kinda helps managing dependencies in jobs. Jobs can end up being the dependency chain could be quite complicated, so having a a tool with that is really nice. Having a tool that lets me kinda debug stages in the pipeline are nice. So you have some UI. Let me show where things failed.
Hopefully, we can start and stop different steps and jobs is really nice. And making sure that it's not too hard for developers to get up and running with it, make sure it's something that people will will find value add. And so what are the tools that you're using now and have used previously
[00:27:55] Unknown:
that you have been most satisfied with?
[00:27:59] Unknown:
Yeah. So right right now, we're using, Osgoodman. It's it has some really nice things. Lets us the UI is pretty nice. Lets us deploy things out pretty easily. We've built out some custom things on top of it that lets us, incorporate a, like, continuous build integration into it. Let's say in in the past, other tools that I've used, I've used, Luigi in the past as well. Luigi had similar similar things. I don't remember being able to quite as easily start and stop different stages in the job, but that was it it was I really liked its visualization layer, and I found I was able to write more customized with Luigi. Those are the probably 2 2 big ones that I've used. And, I mean, I've used CronDab in the past too, but that's there's not much nice about that.
[00:28:49] Unknown:
I I don't really have anything to say on that front. It's great when it works. Often it doesn't. And in terms of your preference of build versus buy for the tools that you're using, both for the workflow engine and the different storage and processing layers, how has that changed over the course I would
[00:29:15] Unknown:
I would say my my opinion on Build versus Buy is that whatever is critical to your business, whatever is at your core business, you should always build. Nobody's gonna know the business as well as you do. Anything that's sort of ancillary to that, should buy. And I don't think that stance really changes in terms of different scale, But what becomes critical to your business can change at scale? Oh, I would definitely say any sort of logic when you're pulling data in from source systems. So you have even your the source systems have had some criticality inside your your organization or your business, and those have usually been customized. So if it's something like a if you're support heavy and somebody you're using something like Zendesk to really manage some some sort of ticket flows and you have custom things built inside of there, well, you might wanna build the the API extraction from Zendesk into your data warehouse just because you might need to know what some of these custom data points mean. But something like a workflow scheduler that doesn't really affect the business or, something like a even just cluster management up in terms of oh, so we use EMR. At Growpubs, we have cluster management. Like, that can be it's not necessarily core, so we can we can buy whatever we can buy with that. And the definition of buy in this context has even become fuzzy with the proliferation of different open source tools and frameworks. So I'm wondering what your thoughts are on
[00:30:41] Unknown:
what qualifies as buying versus building these days because that line is getting pretty blurry. It yeah. It is. I mean, it's sometimes
[00:30:49] Unknown:
open source is great, especially in the big day world to be, like, have a a lot of tools that are kind of necessary to get built. But I would say sometimes it doesn't necessarily have all the polish of what you would if you buy something. So it really depends on how critical is this to your business and where are the failing points. Probably good example would be something like, I think you mentioned Redash earlier. It's like a open source, which is really good for getting it up up and running. You you people you can give people access to the data using Redash without putting too much effort into any sort of sales cycle or any sort of evaluation periods, but it's missing a lot of features that a lot of business users and myself, would actually like. So sometimes buying may be beneficial there. And it doesn't have to be a a 1 or nothing solution either. You can mix and match.
[00:31:48] Unknown:
And in your current role in particular, but also in your past experiences, what are some of the types of dead ends or edge cases that you've had to deal with in terms of building and managing and growing these different data pipelines? It so I'd say
[00:32:04] Unknown:
1 sort of mistake that I've I've seen is people can even inside of or I talked a couple of about how you can decouple your your data pipeline and you have your transactional system, your data lake, your curated data assets, people can end up being siloed into those, not necessarily thinking about the data pipeline as a whole. So you would it's very it's natural. So you have a transactional system, this software engineer working on a transactional system. If you don't have any sort of well defined interface between that transactional system and the analytics system, it's very easy for software engineer just to not think about the analytical system because it's not something that's, he's he's need to. And similarly on the analytical side, it's very easy to update. Well, I'm building data into the data lake, and that's all I really need to look at with that. And then forgetting, oh, 0, well, actually, people are gonna need to pull data out of this for different use cases.
Even when you're building sort of curated assets, it's it's a note to me why are you building it. Sometimes it could be easy to overlook, and my view on things is the most the the most the the reason that we're building this at all is 1 of 1 of the most pressing use cases is really to make sure that you can analyze your business, track your business. You can start getting value add into the business too once you start talking about, different data science models, being able to do, like, feedback loops into the transactional system, but making sure that people sort of understand that at every stage, you're building towards a holistic pipeline. If you lose sight of that, sometimes things can be you'll do duplicate work or extra work or things can be brittle or break. So I would say that that's that's 1 thing I'd be wary about. And
[00:33:51] Unknown:
what have you found to be some of the common
[00:33:54] Unknown:
edge cases that you have run up against or overlooked aspects of building these data pipelines? So I'd say some sort of edge cases just in terms of scale. I can just sort of give some things that have happened. This wasn't directly on my team, and I forget some of the specifics, but there was definitely an ETL process that happened. We were how it was processing things, we ended up failing, because we hit the max int number. So that was definitely something we didn't account for. But then we have other sort of edge cases. In terms of business, We have I think I was talking about some of, like, the values and frameworks before, particularly in the Grubhub Business Thanksgiving.
Our our volume drops off quite a bit. We haven't had in the past with all sorts of validation warning bells going off, and it was normal, like, because people just were eating dinner at home. And the other I wouldn't say this is actually an edge case, particularly at Grubhub. This is just a an airing of my grievance on time zones in general. I'm not a fan. It it You you take pictures everywhere. Yeah. It's particularly for at, like, a little use cases. I mean, it's easy to store UTC, but, we we deal with times everywhere. If we deal when does your job kick off? If you have a time based, like, scheduler, when does it kick kick off? That changes midyear price. But then if you're storing everything in UTC, not everybody is going to want to look at it in UTC and how do you make sure that you're exposing
[00:35:17] Unknown:
the right times to the right person? So that's that's a time zones are not a fan. Yeah. Yeah. There's a great list of falsehoods that programmers believe about time, most of which are contradictory to to to to the other items in the list. So I'll I'll add that to the show notes because it's always good for a laugh and a groan. And what are some of the plans that you have going forward for improving the pipeline that you're building at Grubhub and trying to bring it to the next level of scale and resiliency?
[00:35:48] Unknown:
Yeah. I mean, so I mean, some of the things we we've touched on, you know, we talked a little bit how we have our our 2 day warehouses. We're gonna start leveraging streaming even more. We have some streams in place, but, it's been a a point where we really need to use them more in order to efficiently scale. So that's definitely a hot item that we're going to do. Second piece is less on the pipeline in general, but it's more on metadata about our pipeline. It's another thing that we've seen is as all these transformations are happening and data's moving, it gets really hard to know if I'm looking at this particular column in this particular row, how did that data get there, or what does it actually mean?
Oh, what could have gone wrong along the way? So really putting in more more pieces around around that. So how do we data either data lineage or data, dictionaries like that. That's another piece that we're we're big at using We're going to be improving out of pipelines. And are there any particular
[00:36:48] Unknown:
references or resources that you found particularly useful over your career and or anything that you recommend people look at for anyone who's looking to build and design a new data platform or new data pipelines?
[00:37:02] Unknown:
Oh, yeah. I mean, I would start with and first, I would plug my my blog entry on bytes.groho.com just for some it's it's high level. Gives you some broad ideas as to where to go, but it's a good sort of if you get in the right mindset. Personally, Airbnb has some great articles that I've looked at. I've had some really good success participating in, just local data user groups, talking with people about, what they use. There's I could probably dig up some more references, after this that we could probably plug in, put in the the show notes, blanking on any particular any particular article that I would recommend. But, if I, I can try to go look look for some of them and maybe we could post them. Yeah. We'll definitely include those in the show notes. And are there any other aspects
[00:37:45] Unknown:
of building data pipelines and scaling organizations and applications and technology stacks that we didn't cover yet, which you think we should discuss before we close out the show? I think that was the most of the parts we said we're gonna talk through. And so for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's that's available for data management today? And I and I think this will go back to the, 1 of the improvements that we wanna do is
[00:38:21] Unknown:
particularly when you're talking through, like, open source tools, big data tools, really having something that can kinda help a tool that kinda helps holistically tie data lineage together has been really tough to find. There's definitely some out there, but they don't incorporate with everything or they're really hard to to integrate into everything. I would say that's probably 1 big piece that I've I'd be looking for. It's kind of we we have all these great tools on how to measure our business, but how to measure the measurement, I've been having trouble finding
[00:38:54] Unknown:
really great tools with that. Alright. Well, thank you very much for taking the time today to join me and discuss your experiences of building and scaling data pipelines and organizations. It's been fun. It's been a useful conversation for me, and, I appreciate that. And I hope you enjoy the rest of your day. Yeah. You too. This is great. I love fun. Thank you.
Introduction to Christian Heinzmann and Grubhub's Data Engineering
Challenges of Managing Multiple Data Warehouses
Defining and Building Data Pipelines
Scaling Data Pipelines for Growing Businesses
Minimizing Impact on Source Systems
Transitioning to Large Scale Data Systems
The Role of Data Lakes in Analytical Infrastructure
Ensuring Data Quality During Transformations
Selecting Workflow Engines and Tools
Common Pitfalls and Edge Cases in Data Pipelines
Future Plans for Grubhub's Data Pipeline
Resources and Final Thoughts