Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann

Hello. Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast dotcom/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy, and today, I'm interviewing Christian Heinzmann about how data pipelines evolve as your business grows. So, Christian, could you start by introducing yourself? Sure. Christian Heinzmann. I'm the, director of data

engineering on data warehousing for, Grubhub. I've been there for about,

2 years.

Really what I'm been working on is dealing with our ETL pipelines. We have,

2 data warehouses.

1 is a more traditional. 1 is our more big data pipeline. That's, what I do day to day.

I've learned a lot of things along the way. Made some right decisions, made some wrong decisions.

Alright. And do you remember how you first got involved in the area of data management? Yeah. So when I this is I started my career as a more of a traditional

software engineer, but a lot of my work was heavy heavy dave data processing. So there's a lot of scraping of websites,

cleaning up that data,

storing it. As I I was going through that, I sort of got very interested in business, just how what does the business need, sort of pivoted my

career toward a little bit more towards startups.

And in the startup world,

started building more of our analytical data warehouses,

which let me interface with all the areas of the business, which was something I was interested in. Got really interested

in business and data warehousing,

and the how you deal with all this data, and started moving up to,

through the startup world, ended up going to to Grubhub

where we actually have,

large amounts of interesting data problems. That's for him. And speaking of interesting problems,

the fact that you have 2 different data warehouses

for, I'm assuming, slightly different purposes,

I'm sure it poses some unique challenges as far as how you're processing the data and making sure that everything stays in sync. So I'm wondering if you can just briefly talk about the sort of main use cases that each

storage location serves in terms of the broader business needs. They're trying to serve,

very similar use cases.

In some ways, it's really we're trying to

how do we do this data warehousing

at the scale that we are?

There's so

we do our more traditional data warehouses in in Redshift

which is really good for

fast ad hoc queries

as long as the the server is not under too much load, it's great for that.

But what it's really had challenges on was

scaling the right part,

in the loading part.

So that was really what was

started the move to a more of a a big data stack. So our big data

stack, this is a lot of our ETL is done with high

Spark and Hive on top of Amazon Amazon EMR. We can spin clusters up and down, which basically gives us almost

infinite

amounts of

compute power, which really helps our right scale. And then that's that's probably a big difference of having that in that data warehouse,

really helping with our right scale. And there's there's definitely challenges keeping them in sync. Sometimes we try to sync from 1 to the other. Sometimes we we don't. It's a,

work in progress, it sounds like. Yep. Yeah. It's definitely 1 of these. I think this is 1 of our some of the challenges came from when I was talking about scaling ETL is that we weren't able to scale,

with our Redshift. How do we there was a lot of other

choices how do we done. Probably would have been a made this transition a little bit easier. And so given your experience

at startups and now moving to Grubhub and trying to scale the capabilities

for data processing there. You ended up writing an interesting blog piece that inspired me to reach out to you and talk a bit more about

your experiences

of building ETL

and data processing pipelines

for some of these different scales of organization and data volumes. So I'm wondering if you can just start by sharing

what your definition is for how you think about a data pipeline.

Sure. So

when I say data pipeline, I actually have a very probably simplistic definition.

I would say anytime you're going to move or transform data from 1 place to another,

that's your data pipeline.

It could be something very simple. You can just have something like

a cron job that does a SQL query. That's your data pipeline.

Could be as complex as having a dedicated

scheduler

pulling data from,

streams and from

multiple data sources to combine them together that ends up having

multiple

steps in your transformation jobs,

but I would call all those data pipelines. I think a lot of people may end up thinking of data pipeline as the latter, but I definitely just say it's anytime you're moving data, you're building a data pipeline.

And so given that very broad definition,

anytime that you need to deal with data at all, you can start thinking about that in terms of pipelining operations.

And so, in the beginning of the post that you wrote, you were

describing that when you're first building an application or starting to try and build out an organization, that your pipeline should be very simplistic

and mostly manual.

So I'm wondering if you can just discuss some of the approaches that you take at the early stages of a business and small scale data and some of the design characteristics that you should be targeting for that type of pipeline? Yeah. Right. The in a word, the simple

try to make it as simple as possible. At that stage, a lot of times you don't quite know even if you have a product market fit. So all of your engineering resources or most of your engineering resources should be gone into actually figuring out product market fit, whether that's tweaking the products

or figuring out who to talk to. That's where a lot of the your time should be spent and less time actually building up any sort of scalable pipeline. And it's so besides just the engineering side of that, we're not spending engineering time on it. In order for especially, like, a starting of a project, if you're following more of a lean or agile methodology,

wherever you're pulling the data from is going to change a lot. So if you have something simple and and lean that you can actually help that you can change

quickly, is probably the better way to go, really. And I wouldn't even be looking at any sort of complex metrics at that point either.

So anything that we'd look at some really, like, high level metrics, something like how many customers

did you have. I wouldn't maybe not even worry about conversion rates. Revenue, I'd probably get. And this could all stuff that you can pull probably right from whatever production or transactional system you have. So at at that at that stage,

you wouldn't even need a dedicated, like, data warehouse. You're that scale that you're having, there is I'm sure there is some time in the day your database is free enough that you can issue a couple SQL statements to get some data out of it. That's so simple in a word. Yeah. And as you're saying, just being able to run a few SQL statements and dump it out to some CSVs

and just do your processing in Excel or whatever spreadsheet program you use should be sufficient, and that has the added benefit too of saving the engineering resources that you could be spent building the product, but also not bogging down anyone else in the business who is trying to gain some insights from that data

in terms of having to train them up on using whatever tool you're leveraging to be able to create the reports, because pretty much everyone knows how to use a spreadsheet.

They can do their own analyses.

At this stage, there isn't really enough complexity or enough different transition points that you have to worry too much about having sort of like a golden master of the data,

where you're worrying about different people getting different insights from the same resources because you're all probably gonna be in the same room and can just sort of talk over it. Right. Yeah. And you you don't even have to worry about the, breaking any limits of Excel or Google Sheets. They you won't have that much data.

And so as you start

to build out the applications or build out the business and gain more customers or more data, what are some of the indicators that you look for to be able to signal that you're starting to reach that next order of magnitude in terms of scale of complexity,

scale of data, scale of the organization where you need to redesign

the requirements for your data pipeline and some of your considerations of how you would approach that rearchitecture.

Yeah. That's a it's a good question. I think it's something that probably it's gonna be hard to to see if you're not looking for it, but there's definitely

like, we talked a little bit about, like, product market fit. As you're starting to gain some traction,

this is definitely gonna be a time where you should start paying a little bit more attention to, alright, we're gaining some traction. We have some users.

We have some idea of how the business runs, at least today. You'll start getting

more insights into

either what your product's doing,

how people are interfacing with your products,

how if you're doing any sort of things with

logistics, how your systems are operating,

You'll start be people will be able to ask more sophisticated questions of things. They'll start trying to wanna optimize a little bit more. They'll see inefficiencies,

and that's really the time where it's really good to start actually

laying down a little bit more of a dedicated analytical system. You'll also start seeing, different access patterns of your data. If you had some bizarre, very simple pipeline, people are gonna start asking for a little your SQL queries. You're gonna get a little bit more complex.

Start being a little bit more sophisticated.

It's these are all signs that, alright, maybe we should be building a more dedicated analytical system up, particularly since those queries that you'll be building will be more than likely more aggregate in nature, and the your production data warehouses will more likely be transactional in nature. So having them separate for separate use cases around that time starts to make sense. And as you're starting to approach that medium scale, even if you don't have a dedicated data warehouse in place yet, 1 potential beneficial next step beyond the spreadsheet approach is to start employing some sort of businesses intelligence

tool, whether it's something like Metabase or Redash or Looker so that everyone has 1 view of the data. They're all using the same queries instead of everybody crafting their own aggregates.

So that way you at least have some commonality in terms of the information that people are seeing, and it can store some of those computed aggregates

within that business intelligence platform before you get to the full scale of having a

data warehouse or a data lake. Yeah. And actually, you touched on something that's very dear to me is I I'm a very,

actually, I think I listen to your,

I listened to 1 episode of your podcast,

I forget,

by who we're talking about curating data, and that's actually 1 of the things that's very dear to me as well.

I think that's sort of like that stepping start

stepping points

with these these common

queries and access patterns. You start

knowing what to start looking at to curate.

So as you start to reach a more medium scale in terms of the organization and the data volumes that you're working with, what have you found to be some of the complexities and challenges that begin to present themselves as you start to build a more production grade data pipeline

and run these jobs on a more frequent basis?

Yep. Yes. So there's

there's definitely challenges. I mean, if only any sort of data pipeline is going to have some amount of brittleness is it's hard to use the word brittleness, but there's definitely as data's

moving, there's

a lot of potential for change. And once you're at this sort of medium scale,

you're probably

don't wanna take the time to decouple

everything completely, which means

you have things a little bit more tightly integrated.

Some of the changes in 1 system may cascade into bigger failures. And then just in terms of

build, what is the right amount of things that you should be putting inside of the data warehouse? I've definitely seen cases just organizationally people get very excited about,

okay, we're finally we're we're gonna have a more of an analytical data warehouse. They want everything in there.

But that's it takes time

to get everything in there, and they may not actually look at all of it. So really,

it's in some ways taking some of the learnings from the

the earlier

forms of the queries and dashboards,

spending a lot of time on

the real important pieces of data that people are looking at

And then

also as you're building it out,

you're gonna have your issues with wait. Now it's a production system.

Let's make sure

and this is something I've seen people skip or not quite go as

deep, like this is an easy part to miss some of the

monitoring of

your pipelines,

tends to be something that just falls off sometimes. It's 1 of those it's working. People are analyzing the data.

Quite realize that it's actually a production grade system.

At some point, it

it's a it's an interesting transition once it goes from a, okay, we're just hacking together not hacking, but we're we're just pulling together,

SQL scripts, and it's important, but not a production system to

this is actually turning into a production system.

We should actually have production good type controls on it. Yeah. And I I think 1 sort of good

metric to measure

how much of a production system it is is how often people complain when things stop working. Right. It's at the early stages.

It'll be, oh, it's broken. Nobody noticed. Okay. This is fine. But as more people start to realize that there is the system in place, that it is valuable, that they can gain some useful information from it, then they'll be more likely to let you know when things stop working. And then you start to realize, oh, wait. I need to put some more quality controls and,

metrics and alerting in place to make sure that this stays running when I'm not looking at it actively.

And, I mean, in in some ways, you're actually your fixes end up being a little bit harder to deploy as well. So, like, before if you have just SQL script, oh, it didn't work. Let me tweak my SQL script and just put it back onto the server. Once you have a get a little bit more of a dedicated system, there's usually a little bit

more of a more robust release process,

which may take a little bit longer. You'll probably have multiple a lot more people looking at the the code.

So there's definitely PRs,

reviews, and that sort of thing. So And another problem that can start

to make itself known even at the medium scale, but particularly going into as you get to a larger scale and more complicated

analyses

is trying to minimize the impacts on the source systems that you're pulling the data from to be able to populate these data lakes and data warehouses. So I'm wondering if you have found any particular strategies that are useful for trying to

prevent any sort of production impact on the applications that are using those data sources as you're building these aggregates and, doing these extractions?

Yeah. No. It's actually very it it could be a big problem. When you're gonna try to pull

especially,

some of these queries can be

quite

intensive. You you don't wanna take down your production system or even slow it down. We say there's a different

sort of depends on

in some ways it depends on your your stack for your production system. I can backtrack a little bit. I would say,

oh,

1 thing that I find is very important, this is kinda going back to my software engineering background and thinking about things in terms of who has what responsibility and how do you can you appropriately,

decouple things, I would say

if you could start putting in any sort of more well defined interface between your production system and your

analytical system,

this could either be putting some well defined

data onto,

shared storage or it could be some sort of streaming platform

or any sort of,

like, sort of pub sub architecture

that a analytical system can plug into,

this well defined interface between the 2 systems, you can kinda put that if you put that in early,

that's going to help out with not impacting the production system, because that's usually, they just have a very small amount of work to do if and just push once, and then it's done.

And then when you're reading it, you don't

impact production at all. But that's

that said, there's also

other kind of strategies you can have. If you have depending on your the type of production system, transactional system you have, if it's more of a relational database, of something that works really well is just using the capabilities of having a

replication node and just your analytics point to the replication node. That works really well. But probably the,

and as you get a little bit more

if it's not relational, it get a little bit more into, like, the NoSQL land. We've pulled stuff in from

different backups and

other things that aren't

actually talking to the production system. It's another

process of the production system pushing data somewhere. So it it knows its access patterns better. But I'll repeat about that. If you had some sort of streaming system that helps a lot. Yeah. And I like your point too about having a defined interface for being able to pull the data from because that can help reduce some of the

brittleness in sourcing the data. Because if you have maybe a defined API,

particularly if it's well versioned, then you can predictably

have the same shape of data and same structure of data each time you're running these jobs rather than having to worry about any underlying database migrations that might occur as part of the application life cycle and then having that break your data loads and transformations because there's either an extra column you didn't account for or a column's been renamed or a data type has been changed. But having that API, you're more likely to maintain consistency,

and you're more likely to

have a

discussion, particularly as you start to break up your teams between

software engineering and data engineering of having that established interface

to couple those 2 systems and those 2 organizational teams. Yep. Completely.

And, as you mentioned, streaming systems, another approach would be to use something like change data capture, which reintroduces some of the potential for brittleness as the structure of the database changes, but helps to reduce the overall impact on the source systems because you're not using up computational resources at a web layer via using some sort of API.

But it increases

the complexity

and the challenge on the data engineering team to be able to reconstruct the data from those,

changed data records. Yep. Yep. Yep. And so,

as you again start to

go beyond that medium scale and into another order of magnitude into so called large scale and big data systems and start to integrate multiple data sources together beyond just what your applications are producing. I'm wondering again if you have any sort of indicators that signal that you are starting to reach that next order of magnitude

and some of the ways that you start to consider redesigning your data pipelines and some of the approaches that you would take to be able to build more high level and complex,

aggregations and metrics and analyses on top of those different data sources? In some ways, it's gonna look similar to when you went from small small to medium.

You're gonna start some of your parts of your system will get more stressed. Also,

you'll have you wouldn't necessarily have similar queries, but you'll have people asking for the same metrics over and over again. I would say that that's 1 big indicator that your product or organization

is becoming mature. If you've done some of the medium scale place, I think coming to large scale becomes a lot easier. But if if you haven't, there will definitely be some you'll hit some limits in terms of

processing. You won't be able to keep up with the volumes of data. Your data increases

by 1, 000, 000 or billions of records a day. These are all sorts of things that indicate you probably want a little bit more of a large scale system. In some ways, even just having the number of different sources you wanna integrate with is also sort of an indicator. As

organizations grow,

I found the tools

aren't necessarily

standardized cross teams. If you have your sales team could be broken up into

different sorts of sales, and they may be using

different sorts of CRM systems. Your marketing team may be doing different sorts of marketing. You may they may be tracking those forms of marketing in different systems.

Your operations team may be looking to interface with,

some other tool that lets them really understand

how the business is operating. And I think once you start realizing be having all just these number of things, that's another sort of indication that you're getting into more of a larger scale.

And so particularly when you reach the large scale

of data and organization,

but even potentially at medium to small scale, there's been a much bigger focus on using data lakes in place of data warehouses or in supplement to them. So I'm wondering what your thoughts are on that overall approach and how you see data lakes fitting into the overall analytical

infrastructure for a given organization?

I I really like data lakes, particularly when used,

appropriately. I think when, I was first hearing the concept of data lakes and,

some of my peers first heard the concepts,

there was half some some were really excited. They didn't have to do any work anywhere. They just dump everything in data lake and they're done. Others were afraid of, well, if everything all the data's there, how are people gonna analyze it? And I think both those fears are sort of valid, so I think you need to having a having a data lake really helps kinda decouple inside your data pipeline. The data lake can kinda be a a staging area for a lot of different data.

It lets you have this persistent storage sort of in the middle

of what I would call it a full ETL pipeline. So if your your system

or whatever you're using, you're using a query and the query breaks

for some sort of curated data,

you don't you probably don't have to go all the way back into the transactional system. The data's captured already under the data lake. So it's kinda lets this you can debug

and

develop

solely in an analytical kind of environment. But I think the data lake has to be managed. It can't just be

dumping ground of things. You still need

to make sure you have organization in there. You have to make sure you have,

access patterns. It it it's it's an important I think it's an important piece, and it helps a lot with scale. Long as it's not treated as like a dumping ground. It it works really well. Yeah. You don't want it to turn into a swamp.

Correct.

And,

as you mentioned,

a lot of times they're used as sort of a staging ground for the data after it's been extracted from the transactional systems or from

the 3rd party data sources.

And the transformations

can then be performed off of that

staging ground.

So that can help minimize some of the potential loss of fidelity or loss of information from either bad transformations or ill considered transformations.

So I'm wondering if you can talk a bit about some of your approaches on that front to try and

reduce the impact of transformations

on the,

quality

and efficacy of the data that you're processing?

Yeah. There's definitely

1 thing that's

sort of nice having a a deal. Like, it kinda is a safety

net in some ways.

It kinda you're not going

to lose

anything, especially if you're gonna

do certain, other transformations.

You can actually but sometimes, actually, some of the transformations you'll intentionally

wanna discard

certain amounts of information, but maybe you have,

outliers that you wanna clean up for certain

workloads

or you want to

there's some form of

records that you know are

usually some sort of test pattern or maybe it's somebody

doing something weird with your transactional system, and this isn't the table that is going to care about that person doing something weird. It really wants, like, how actual real users are using the system. So you can you'll actually send a throw those records away. But when people are using that, sometimes questions can come up, and you can actually always go back to the,

more raw source data in your data lake, analyze that, say, okay. This is why we've discarded these records, and then maybe we need to

tweak how our logic is. And then you can always run a backfill of on the

downstream ETL, whatever that table was, so you can clean that up.

Like, other ways, there's definitely,

having,

valid diction works, validation frameworks,

when you're doing ETL and transformations also helps. That's another thing I need balances and trade offs on. I've seen

some validation

checks be a little too aggressive.

Sometimes something weird is actually something normal.

And if you do too aggressive, then you'll start failing when things shouldn't actually be failing,

but too loose, and then you're gonna have your your bad data. So I'd say validation frameworks are important, just have to be used, appropriately as well. But, again, back to the it's nice having that safety note of knowing

my source data is there. I didn't lose anything. I'm not going to

take down production because

I have a data lake on some big storage.

And in terms of the actual

workflow engine and the,

the tools that you're using for performing these different stages of the pipeline.

I'm wondering what you have found to be useful selection criteria

in terms of the technology that's being used and the way that it fits into the organization and the team that's leveraging it.

Yeah. I'd say when

when researching for, like, the technologies and the engines that we'll use, I really want a balance of, like, ease of use,

kinda the features,

and how flexible it is. I really wanna make sure that it's something that fits inside of the environment. So, like,

at Grubhub,

we have certain

tools that we standardize on, and so whatever we pick should be able to incorporate with Jenkins.

Our data team is a Python shop. It should be able to people should be able to interact with it in in Python.

So being able to make sure it it fits within the skill set of the organization

and the tooling of the organization

is probably

1 of the more important things I would look at. But then other features I would look at, things that are nice for me when I'm looking at, things like the dependency management, anything that kinda helps managing dependencies in jobs. Jobs can end up being the dependency chain could be quite complicated, so having a a tool with that is really nice. Having a tool that lets me kinda debug stages in the pipeline are nice. So you have some UI. Let me show where things failed.

Hopefully, we can

start and stop different steps and jobs is really nice. And making sure that it's not too hard for developers to get up and running with it, make sure it's something that people will will find value add. And so what are the tools that you're using now and have used previously

that you have been

most satisfied with?

Yeah. So right right now, we're using, Osgoodman.

It's it has some really nice things. Lets us

the UI is pretty nice.

Lets us

deploy things out pretty easily. We've built out some custom things on top of it that lets us,

incorporate a, like, continuous build integration into it. Let's say in in the past, other tools that I've used, I've used,

Luigi in the past as well. Luigi had

similar similar things. I don't remember being able to quite as easily start and stop different stages in the job, but that was it it was I really liked its visualization layer, and I found I was able to write more customized with Luigi. Those are the probably 2 2 big ones that I've used. And, I mean, I've used CronDab in the past too, but that's

there's not much nice about that.

I I don't really have anything to say on that front.

It's great when it works.

Often it doesn't.

And in terms of your preference of build versus buy for the tools that you're using, both for the workflow engine and the different storage and processing layers, how has that changed over the course

I

would

I would say my my opinion on Build versus Buy is that whatever is critical to your business, whatever is at your core business, you should always build. Nobody's gonna know the business as well as you do. Anything that's sort of ancillary to that,

should buy. And I don't think that stance really changes in terms of different scale, But what becomes critical to your business can change at scale? Oh, I would definitely say any sort of logic when you're pulling data in from source systems.

So you have

even your the source systems have had some criticality inside your your organization or your business,

and those have usually been customized. So if it's something like a if you're support heavy and somebody you're using something like Zendesk to really manage some some sort of ticket flows and you have custom

things built inside of there,

well, you might wanna

build

the the API extraction from Zendesk into your data warehouse just because you might need to know what some of these custom data points mean. But something like a workflow scheduler that doesn't really affect the business

or,

something like a

even just

cluster management up in terms of

oh, so we use EMR. At Growpubs, we have cluster management. Like, that can be it's not necessarily core, so we can we can buy whatever we can buy with that. And the definition of buy in this context has even become fuzzy with the proliferation of different open source tools and frameworks. So I'm wondering what your thoughts are on

what qualifies as buying versus building these days because that line is getting pretty blurry. It yeah. It is. I mean, it's sometimes

open source is great, especially in the big day world to be, like, have a a lot of tools that are kind of necessary

to get built. But I would

say sometimes it doesn't necessarily have all the polish of what you would if you buy something. So it really depends on how critical is this to your

business and where are the failing points.

Probably

good

example would be something like, I think you mentioned Redash

earlier. It's like a open source, which is really good for

getting it up up and running. You you people you can give people access to the data using Redash

without putting too much effort into any sort of

sales cycle or any sort of evaluation

periods,

but it's missing a lot of features that a lot of business users and myself,

would actually like. So sometimes buying may be beneficial there. And it doesn't have to be a a 1 or nothing solution either. You can mix and match.

And in your current role in particular, but also in your past experiences, what are some of the types of dead ends or edge cases that you've had to deal with in terms of building and managing and growing these different data pipelines? It so I'd say

1 sort of mistake that I've

I've seen is

people can even inside of or I talked a couple of about how you can decouple your your data pipeline and

you have your transactional system, your data lake,

your curated data assets,

people can end up being

siloed into those, not necessarily thinking about the data pipeline as a whole. So you would it's very it's natural. So you have a transactional system, this software engineer working on a transactional system. If you don't have any sort of well defined interface between that transactional system and the analytics system, it's very easy for

software engineer just to not think about the analytical system because it's not something that's, he's he's need to. And similarly on the

analytical side, it's very easy to

update. Well, I'm building data into the data lake,

and that's all I really need to look at with that. And then forgetting, oh, 0, well, actually, people are gonna need to pull data out of this for different use cases.

Even when you're building sort of curated assets, it's it's a note to me why are you building it. Sometimes it could be easy to overlook, and my view on things is the most the the most the the reason that we're building this at all is 1 of 1 of the most pressing use cases is really to make sure that you can analyze your business, track your business.

You can start getting value add into the business too once you start talking about,

different data science models, being able to

do, like, feedback loops into the transactional system,

but making sure that people sort of understand that at every stage, you're building towards a holistic

pipeline. If you lose sight of that, sometimes things can be

you'll do duplicate work or

extra work or things can be brittle or break. So I would say that that's that's 1 thing I'd be wary about. And

what have you found to be some of the common

edge cases that you have run up against or overlooked aspects of building these data pipelines? So I'd say some sort of edge cases just in terms of scale. I can just sort of give some things that have happened. This wasn't directly

on my team, and I forget some of the specifics, but there was definitely an ETL process that happened.

We were

how it was processing things, we ended up failing,

because we hit the max int number.

So that was definitely something we didn't account for. But then we have other sort of edge cases. In terms of business, We have I think I was talking about some of, like, the values and frameworks before,

particularly in the Grubhub Business Thanksgiving.

Our our volume drops off quite a bit.

We haven't had in the past with all sorts of validation warning bells going off, and it was normal, like, because people just were eating dinner at home. And the other I wouldn't say this is actually an edge case, particularly at Grubhub.

This is just a an airing of my grievance on time zones in general.

I'm not a fan.

It it You you take pictures everywhere.

Yeah. It's

particularly for at, like, a little use cases. I mean, it's easy to store UTC, but, we we deal with times everywhere. If we deal when does your job kick off? If you have a time based, like, scheduler,

when does it kick kick off? That changes midyear price. But then if you're storing everything in UTC, not everybody is going to want to look at it in UTC and how do you make sure that you're exposing

the right times to the right person? So that's that's a time zones are not a fan. Yeah. Yeah. There's a great list of falsehoods that programmers believe about time, most of which are contradictory

to to to to the other items in the list. So I'll I'll add that to the show notes because it's always good for a laugh and a groan. And

what are some of the plans that you have

going forward for improving the pipeline that you're building at Grubhub and trying to bring it to the next level of scale and resiliency?

Yeah. I mean, so I mean, some of the things we we've touched on, you know, we talked a little bit how we have our our 2 day warehouses.

We're gonna start leveraging streaming even more. We have some streams in place, but, it's been a

a point where we really need to use them more in order to efficiently scale. So that's definitely

a hot item that we're going to do. Second piece is less on the pipeline in general, but it's more on metadata about our pipeline. It's another

thing that we've seen is as all these transformations are happening and data's moving, it gets really hard to know if I'm looking at this particular

column in this particular row, how did that data get there, or what does it actually mean?

Oh, what could have gone wrong along the way? So really putting in more more pieces

around

around that. So how do we data

either data lineage

or data,

dictionaries like that. That's another piece that we're we're big at using We're going to be improving out of pipelines. And are there any particular

references or resources

that you found particularly useful

over your career and or anything that you recommend people look at for anyone who's

looking to build and design a new data platform or new data pipelines?

Oh, yeah. I mean, I would start with and first, I would plug my my blog entry on bytes.groho.com

just for some it's it's high level. Gives you some broad ideas as to where to go, but it's a good sort of if you get in the right mindset.

Personally, Airbnb has some great articles that I've looked at. I've had some really good success participating in, just local data user groups, talking with people about, what they use. There's I could probably

dig up some more references,

after this that we could probably plug in, put in the the show notes, blanking on any particular any particular article that I would recommend. But, if I, I can try to go look look for some of them and maybe we could post them. Yeah. We'll definitely include those in the show notes. And are there any other aspects

of building data pipelines and scaling

organizations

and applications and technology stacks that we didn't cover yet, which you think we should discuss before we close out the show? I think that was the most of the parts we said we're gonna talk through. And so for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's that's available for data management today? And I and I think this will go back to the, 1 of the improvements that we wanna do is

particularly when you're talking through, like, open source tools, big data tools, really having something that can kinda help a tool that kinda helps holistically tie data lineage

together has been really tough to find. There's definitely some out there, but they don't incorporate with everything

or they're really hard to to integrate into everything. I would say that's probably 1 big piece that I've I'd be looking for. It's kind of we we have all these great tools on how to measure our business, but how to measure

the measurement,

I've been having trouble finding

really great tools with that. Alright. Well, thank you very much for taking the time today to join me and discuss your experiences

of building and scaling data pipelines and organizations.

It's been fun. It's been a useful conversation for me, and,

I appreciate that. And I hope you enjoy the rest of your day. Yeah. You too. This is great. I love fun. Thank you.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links