Rebuilding Yelp's Data Pipeline with Justin Cunningham

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at www.dataengineeringpodcast.com

/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help support the show, you can check out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media.

Your host is Tobias Macy. And today, I'm interviewing Justin Cunningham about Yelp's data pipeline. So Justin, could you please introduce yourself?

Sure. I'm Justin Cunningham. I lead the data pipeline team at Yelp.

I'm a software engineer here. I've been at Yelp for about 3 and a half years

and,

started the data pipeline project and have seen it through to where it is today.

And how did you first get interested or involved in the area of data engineering or data management?

So I first got involved in data engineering or data management,

primarily,

driven by application needs. So I started off,

effectively working on experimentation

infrastructure,

and data engineering and experimentation are really interrelated because to run experiments effectively, you need access to,

rich and accurate data. In addition to that, at Yelp, I started working on data warehousing and also on our metric system, and we had a lot of duplicative

infrastructure between those 2 projects. So I was able to kind of get started by taking a look at what infrastructure we had and tried to rationalize it into

a more robust,

more centralized piece of infrastructure.

And can you start off by giving a brief overview of the overall pipeline and the kinds of data and workload that you're optimizing for?

Sure. Our overall pipeline

is centered around Apache Kafka

and, schema store. So the schema store has got information on,

which topics or streams that we're publishing to,

and what schemas

are used to encode data that goes into those streams.

Outside of that, core, we have a variety of other applications that are accessing the core and writing data into it, reading data out of it. You could roughly divide those into

consumers, producers, and transformers.

Consumers would be applications that are

consuming data from Kafka, from the data pipeline, and doing something with it. A good example of that would be we have a Sales Salesforce connector that reads data from Kafka and writes it into,

salesforce.com

through their APIs. We also have a connector that reads data from Kafka and writes it in to, Amazon Redshift, another that reads data and writes it to Amazon s 3. On the producer side, we have applications that are producing data into Kafka. Concretely, a couple of examples there would be a producer that reads data from MySQL change log

and produces it into

Kafka so that we're able to access that data and and see what changes have happened. So it's effectively a change data capture system. Outside of that, on the transform side, we have systems that read data from Kafka and write it back to Kafka doing some sort of intermediate transformation. We use this, to decorate data to make it more legible externally if it's coming from an internal system with some kind of specialty encoding and generally just to clean up data or make it more accessible.

From reading through the blog post that you guys have written about the overall process of coming to this final architecture,

I was seeing that you had originally had a more sort of monolithic

and ad hoc method of managing the data throughout your system. So I'm wondering if you can describe some of the impetus for tackling this overall project and rearchitecture

of your stack and the way that the data flows through it. Absolutely. So early on, you know, the team that I was on originally was called the BAM team. It stands for business analytics and metrics,

and that team maintained a few different systems.

1 of those systems was Yelp's data warehouse, and what that was doing is effectively ETL ing data out of the primary Yelp database and then putting it into a Redshift database so that we could access it more efficiently for analytics queries. Another piece of infrastructure that the BAM team maintained was something called the Salesforce field fillers, and that was a infrastructure system designed to move data from the Yelp primary databases into Salesforce.

And the 3rd system that the BAM team maintained,

was a metric system, and the metric system was just calculating

some aggregate values on top of the Yelp primary database

and,

actually reinserting it into the primary database for display.

It was, kind of a business intelligence system. So it was effectively

computing metrics on top of the Yelp databases to try to figure out, how the business was performing. The rationalization

step there was effectively looking at what system

overlaps there were.

And in principle,

both the Salesforce connector that we had and the data warehouse

ETL system

were very, very similar. They're reading data out of the primary databases,

doing light transformation on it, and then putting it somewhere else. Well, they're very similar in concept, but they weren't sharing any code or any infrastructure. So we were maintaining 2 sets of infrastructure to read and transform data even though it was doing basically the same thing, and each 1 had a different set of limitations.

So doing something about that started to make a lot of sense. The other thing that we were looking at was the metric system, which was, as I said, just reading from the primary databases and calculating aggregates.

Well, it's not really, really efficient to calculate aggregates against MySQL

and to do analytics queries against MySQL. We already have a data warehouse with Redshift.

So it made sense to start migrating

our analytics queries on top of Redshift where we could get query performance

and be able to do better exploratory queries and answer our questions a little bit more quickly and correctly than we could on top of my SQL. So in effect, we were able to go from having 3 disjoint systems to having,

1 unified system where we're doing change data capture

and taking that change data capture and then putting it into

output systems where output systems would be, you know, Redshift and Salesforce to replace our existing systems, and then moving the metrics problem that we had before from running directly on top of MySQL to running on top of the data warehouse, where it can run a lot more efficiently.

And while you were going through the process

of picking all the different pieces of the pipeline and figuring out how they were going to hang together. Got a couple of questions in there. 1 of them being, what are some of the dead ends or false starts that you experienced while going through that overall process and trying to come up with a workable design?

So the dead ends and false starts that we experienced,

I I think we actually did a really good job of minimizing the number of those. And I I think we were able to do that pretty effectively by doing a lot of preplanning. So 1 of the things that I did before we started the project is I read a lot of academic papers,

mostly coming out of LinkedIn

around their data infrastructure and how they'd solved a lot of similar problems.

In particular, LinkedIn had a system called DataBus,

that was capturing

the stream of changes

from databases

and then making them available.

And after that, they'd written Kafka and done a lot log processing with Kafka.

And it seemed like what we wanted was somewhere in between Databas and Kafka.

And in exploring that space, it kind of became clear that actually a lot of the new Kafka features were targeted at being able to do a lot of the kinds of things that you could do with DataBus,

in particular, log compaction. So

I think in general that our our overall system architecture hasn't changed substantially,

since it was laid out in the initial design docs. So we actually didn't have very many

false starts on this.

Maybe 1 of the bigger ones that comes to mind

is a time effectively where where we were planning on using

Apache Storm for most of the real time processing. We ended up moving off of Apache Storm because we couldn't quite get enough performance out of it in Python.

And we ended up,

doing a little bit of custom stuff. And and at this point, we're looking at using Flank to to work around that. So I I think we got not very far down the road with Storm, but certainly a little bit further than I'd like with something that we ended up ultimately not using. But, otherwise, I think we actually did a remarkably good job of laying out a fairly comprehensive plan early on and then not deviating from it very much.

And while you were going through the process

of coming up with that plan, did you go through a lot of iteration cycles of getting buy in from the team and reviewing the plan before you started doing any implementation steps to try and see if people could come up with weaknesses? And then also, I guess, how far outside of your immediate team did you go to try and do requirements gathering from the, sort of final consumers of the data across the business?

Yeah. So overall with this system,

there was a very, very extensive planning process, and I think this is true at Yelp in general. On our engineering team, we do something called separating.

And a SEP is a code enhancement proposal. It comes from Python. Python enhancement proposals because we're primarily a Python shop. So so the basic idea there is that if you want to build some kind of system, you

do a relatively short write up. I think the data pipeline set was less than 10 pages on what the system will be, what the requirements are, what problems it's trying to solve, how it will solve them. And it's not really a technical document necessarily, or at least it doesn't go into lots and lots of technical detail. It's more about the requirements

and how the problem will be solved. For the data pipeline, there were quite a few teams involved in the planning process and involved in the feedback process. I would estimate probably 15 different teams were involved at at some point or another in either requirements gathering

or signing off. 1 of the other things that we we did in that project, which I think I think was a little unusual at the time for Yelp at least was reaching out to other companies in the space and seeing what other people were doing. And we invited some folks into Yelp to talk a little bit about, what had gone well for them and things that they'd they'd change. And I I think in doing that, we were able to expose a lot of the caveats. And I I think that's actually why the the project didn't have that many real speed bumps because we were able to expose so many things that other people had tried and how they succeeded that really well or didn't succeed at as well very early on in the process. So we were able to make lots and lots of, design time adjustments.

And because there was so much feedback across the organization, I think we were able to get a lot of different perspectives on what the different problems were and how we would solve them. And and, actually, this project started immediately as a cross team effort. Our spam team was,

as interested as the BAM team in solving some of the challenges around moving around bulk data. And I I think we, on the BAM team, concretely had really good use cases for it. So we we had a wide swath of things that we'd wanna be able to accomplish with the system, but there were definite

considerations from other teams involved,

as to what they would need in the final product for it to be useful. And as you were going through the design process and figuring out what the different elements were in the overall pipeline and how the data was going to flow through it and trying to figure out which components were going to fill the various requirements.

What were some of your criteria and decision making process around the overall build versus buy, and which are the pieces that you did end up ultimately deciding to build in house because of not having anything that specifically met your requirements for some of those various pieces?

Build versus buy for this project is is kind of an interesting discussion because at the time that we've started the project, there wasn't much of a decision

for for almost all of the components. And what I mean by that is that when we first went down this path, there wasn't

much of a space around it. There weren't a lot of products that you could buy.

Confluent, the company that does Kafka, didn't exist yet. So there wasn't really much of an ecosystem at all that we could rely on. So we we ended up building a lot of the core infrastructure,

some areas where we were able to use things that,

we didn't really buy per se, but were otherwise off the shelf. We relied on a lot of open source Python tools and packages.

I think that went pretty well effectively just leveraging, you know, some of the communities work on Kafka compatibility. We use the Apache Avril, all of that kind of stuff. But for the core system components, just because there was so little in the ecosystem at that point, there wasn't a whole lot that we could use that was really off the shelf and ready to go. That said, if we were to repeat this project again today, I think there's a lot more development in the ecosystem, and we would almost certainly end up using a lot more of the ecosystem components, building on top of the ecosystem components, extending them, or, you know, otherwise adding support. I think 1 of the themes that's going on today

on this project is effectively looking at the ecosystem and trying to figure out what pieces of the ecosystem

we can incorporate now or, yeah, for some of our components, even if we can replace them with things that have developed in in open source or commercially in the interim that would make the system more robust or faster or more reliable. And I think that's the space itself is getting a lot more interesting from the perspective of having both really, really good open source projects like Apache Flink, for example, and also having more

commercial support, like with Confluent,

and a lot of the tools that they've built, like their schema registry, Kafka Connect, etcetera.

And how long did the overall process take from the point where you decided that you needed to do an overhaul of your data management infrastructure

to the point where you said, okay. This project is essentially complete, you know, for whatever definition of complete you decide to subscribe to. That's a that's a great question. So our project, I think, is probably still not what I would consider complete,

at at least from a, like, entire system standpoint. That said, I think that the rough timeline looks like, overall, we were able to get something running in production within about 3 months of starting. And we took a little bit of a a different development path with this product than we have with other things. And we started by building a kind of like a hackathon version of the project as a whole that we call the tracer bullet that was, effectively supposed to just identify what the unknowns were in the system and give us a baseline with which to work with to figure out how to work around. So what we did is we effectively set aside a quarter, and the whole team worked on building out end to end support for 1 application. It was streaming the Yelp business table and updates to the Yelp business table into Amazon Redshift. And we built every component that you would need to do that in at least a quasi realistic way in the sense that it was generalized, but we only built enough generalized features to work with that specific

table

and going into Redshift with only the schema support that we would need for that data. And we actually learned a lot from that process about what we need to do to support schema transformations,

exactly how hard it would be to support different kinds of schema registration, what the schema store would look like. 1 of the things that we ended up doing at the end of that project is really discarding all of the software that we built and then rebuilding it in a much more reliable

way. But the things that we learned from,

having

hacked it together,

were really, really invaluable in figuring out how the whole how the whole project and whole ecosystem should really work together.

So to get through the tracer bullet stage,

and get that hackathon version of the project built took about 3 months, and we were able to stream from that 0.1 table from the database,

into Redshift.

And to get from the tracer bullet to an actual production application

running on arbitrary data took about another 9 months. And the first application that we ended up really building

was an application that was streaming arbitrary database changes into

Salesforce

so that our sales teams would have accurate business information to call out to businesses,

which is pretty important at Yelp since we have, a relatively large sales team, and we wanna make sure that the information that they have is as accurate and as real time as possible so that when they're calling businesses, they're able to really have the information that they need to be able to sell them and and present Yelp well. So I still wouldn't really consider that a complete system, and that's after a year of development,

because we were only really able to support 1 1 use case end to end. It took about another 6 months from that point to have a second

real use case online, which was our Redshift connector.

And by combining the Salesforce connector with the Redshift connector, all of a sudden, all of the data that we were capturing from the databases were able to stream into Salesforce. We're also able to stream into Redshift and start replacing a lot of our data warehousing

needs. And then I I think simultaneously,

we're able to start looking at that then and saying, wow. This system is now replacing

these legacy systems that we built up. And we're also starting to see,

some some network effects from having these 2 disparate connectors

that if you write data in this format, you can have access to both of them. And from there, we've just been building out the ecosystem more and more, building lots and lots of auditing tools, correctness tools around it, and lots of bug fixing, lots of

scalability,

performance improvements, that kind of thing. So so I guess you could say from from some extent, I would really consider it to be a viable

platform and ecosystem at about the 18 month mark, but still still not really done.

Yeah. Software is never done until you shut it off and walk away and move on to something else.

Yeah. Exactly.

And so you're saying that decent part of the recent work has been sort of identifying and fixing bugs. So what are some of the different failure modes that you have experienced in the various pieces of the pipeline that you've built up? And what are some of the ways that you, 1, identified

those failures and then 2, engineered around them?

Failure bugs with this system are are really interesting. And I I think it's worthwhile

to think about the numbers a little here. We do

billions of messages a day throughout the whole ecosystem.

So if you're talking about seeing

a 1 in a 1000000 event, we see 1 in a 1000000 events several 1000 times a day. So we have lots and lots of opportunities for really, really rare edge cases to get hit. So 1 of the things that we did early on,

to help identify

performance issues and other kinds of bugs and regressions,

we built a a pretty robust canary system where we actually maintain 2 copies of most of our production infrastructure that we actually roll out more things with the canaries than we do with the actual production instances because it gives us more feedback on what's working and what's not. What I mean by that concretely is that we actually

run more data through our canary than we run through prod. On prod, we only are looking at the things that we're actively using. In the canary, we actually try to push everything through it, and that gives us early feedback on what kind of data we can support, what kind we can't support,

what the scalability will look like when we add more stuff to prod. We also,

introduce some probabilistic failure into our canaries so that as they're running, they actually are forced to fail a little bit more frequently, and that gives us some some pretty nice feedback on what the failure mechanism and recovery mechanisms look like in practice. And,

for example, if we're introducing bugs in recovery, by introducing a little bit of probabilistic failure into our canaries, we're able to say that after a few days,

we're likely to see the recovery problem. Like, our actual production systems don't fail that often, probably only every couple of weeks, perhaps.

Our our canary is designed to fail every

20 to 30 minutes or so, just probabilistically,

and that gives us some decent feedback

as to how recovery happens, if there's any issues at service in there, and what that looks like.

And I would say almost all of the actual production issues we see are really around

either

losing data or duplicating it excessively. And really that boils down to state saving and value recovery and error handling. And I think those are really, really challenging problems to solve with distributed systems in general. A lot of our focus as a team has been on building up knowledge

around how to resolve those issues, building up tooling to make going from some kind of known problem or at least some kind of known symptom back to what the

actual problem is, something that we can do pretty easily. And, otherwise,

getting enough introspection into the system that we can understand if it's mostly working or not. So I think that there's still a lot that we could do there, but,

I think we've gone pretty far down the down the path of doing things kind of like, what Netflix was doing with, Chaos Monkey. And that's been

fairly helpful in identifying problems. And, otherwise, just doing keeping systems online,

running lots and lots of systems, running lots of events through them. I I think all of that helps expose issues early and often.

And what are some of the tools and processes that you're using to be able to

the deployment and maintenance of the various pieces

of the overall pipeline, and what are some of the techniques and tooling that you're using

for monitoring the overall availability and accuracy

of the systems and the data that flows through them?

So for doing deployment

and monitoring

uptime, really, I I think the solution that we've come to for availability and uptime as a company has been building out a platform as a service. We actually open sourced that a while ago. It's called pasta. And, effectively, what that does is it takes Docker containers and deploys them with marathon and mesos and almost every service at Yelp today. And, actually, I think it may be every service at this point is deployed on top of of that, platform.

And what that allows us to do is to kind of treat any piece of code we have as a container,

with all of the the benefits that that entails. So

during during our build process, what we do

is we actually spin up a bunch of containers that simulate or in many cases are the real thing. So we spin up real Kafka, real zookeeper,

and we run integration tests,

with the actual containers that we're gonna deploy to production. And only if those containers pass the integration test do we end up deploying them. We do that for almost every service at Yelp. So it it's not something that's very specific, but I I think it works reasonably well at catching these kinds of problems. You're running the actual code that you're gonna deploy where there's no ecosystem or environment differences. You're just taking the same container

and sending the same bytes out to the machines that are gonna run them. That's worked pretty well. From an application side for monitoring,

we use something called signal effects,

to get graphs, and we we generate lots and lots of time series charts that are all kinds of different stats about the system. And that could be things like, how far behind we are with with the difference between the current time is and the event time that the system is processing,

and, additionally, things like throughput inside of our consumers and producers, like how many messages are we doing per second, what schemas,

are associated with those messages. So we can say, for example, that we're reading in

5, 000 messages a second with this schema ID, and we're outputting 50 messages a second for this schema ID from this worker. And then we can kind of get some nice time series graphs to show in processing time how these systems are actually operating. We do some things with, technology called SENSU. That's what we use for our alerting that effectively does,

active alerting based on different things that we're triggering from application code. So, for example, 1 thing that we trigger on is how far behind,

event time we are. So,

inside of our producer, it it actually looks at the event time that's set on inside of our messages

and the current time. And if those drift too much, we can configurably

send an alert and and page the team that owns the infrastructure.

I think at this 0.1 of the things that we're starting to get into more and more is looking at additional auditing,

that we could be doing. And what we're probably going to start doing is writing auditing messages from our consumer and producer implementations.

They would log

for every event time window, how many messages are we consuming, how many are we producing,

and,

how does that map to upstream and downstream systems. And I think with that information, we'll be able to to kind of get a more holistic view of how messages are flowing through the system, where we're missing data or where we're duplicating it, and how how effectively,

an upstream and a downstream map together and and what that looks like.

And while you were going through the overall process of switching out some of the production components as you were building out these new pieces,

what were some of the ways that you made sure that you weren't starting to introduce new failures or new

error cases into the system and verify that the data that you were getting out was the same that you wanted or better?

Fortunately,

we had already had a number of systems in place to audit a I would call it maybe data reliability

across system. And because we were already doing some

light ETL, we already had a number of systems that could look at, say, an input and an output and,

at least say at a high level if they were pretty close to each other. So,

initially, we were able to to leverage a lot of that. What we've gotten more and more into as time has gone by,

and this is extremely helpful for debugging

is auditing

batches and auditing processes that can look at,

a specific system's input and outputs and do some, static verification on those. So, to give an example of that would be something like a transformer that we're doing from a Kafka topic to a Kafka topic. So what we might do is write a batch that will look at the input Kafka topic and the output Kafka topic and try to match up the messages between them and see that there aren't,

things that are drifting. And then overall, what we'd like to do with that kind of thing is is both use it for debugging. So in the event that we find some missing data, we can trace back and see exactly where in the process the data we're missing.

And, also, we'd like to start scheduling some of those things,

and running them probably probabilistically

on top of the overall data. So we could kind of say, let's take

the 30 or 40 most important downstreams

and trace back through their dependencies

and with some know, fairly low probability,

check their dependencies

every day and see that whatever we got in, we actually got out and just kind of doing some sampling to see if we're actually

maintaining the invariance that we've set for ourselves. So so it's kind of a like a double duty approach for a lot of the the tooling that we've been running for auditing to make sure,

1, that we're not losing data as the systems are running. But 2, if we do have some kind of data quality problem to be able to quickly debug it. And, you know, once we identify where the problem is to be able to ideally work backwards from we're missing this data which came in this batch to

here's where in the operational logs this data,

went missing, and here's the concrete reason why that happened so that we can further solidify the system.

And 1 piece that I've been wanting to dig into a bit more is in the overall

schema management and data serialization.

I noticed that you ended up settling on Avro as the transmission format. So I'm wondering if you can dig a bit into what are some of the other serialization formats that you considered for the project, and what made Avro the right choice for you. For serialization formats,

we looked at a number of things. I think I think Avro

was an early leader

primarily because of the amount of ecosystem support, especially around Kafka itself

and in particular

with LinkedIn's experience with Avril. So I personally was pretty pretty set on using Avril fairly early on because of

the experience that LinkedIn have had with it. That said, we had a pretty extensive project looking at different schema formats,

and, we considered

Huddle Buffs

and, Thrift and also,

JSON and a variety of things that use or provide schemas for JSON files.

And I think ultimately that there were a couple of things that we really liked about Avro, which is why we ended up going with it. I think the first and maybe 1 of the most significant is that, with Avro, the schema

is an artifact,

and it's really kind of like a piece of data in itself. An Avro schema is just, literally JSON that describes what the data looks like. So with it being just JSON, we're able to store it and transmit it as JSON from an HTTP service,

which fits really nicely with the, Yelp's general SOA and and restful kind of services. And, really, that's all our schema story is. It's it's a restful service that returns JSON where the JSON

is a Abra schema.

And

that fits really well with the the pattern that we have here. I I think as we looked into some of the other serialization formats,

occasionally,

what we'd want to do or maybe even in some cases be required to do is is provide something more like a a git repo that had a bunch of compiled

schemas in code where you would have, like, serialization and deserialization

code. And,

if you wanted to use a schema, you'd have to include that package and and then pull in the schema code and use that for encoding and decoding. But 1 of the things that we really wanted to accomplish with the system overall is allowing a consumer to decode a message with the schema that didn't exist when the consumer started. So in order to support that,

the consumer really needs to be able to fetch fetch the schema or fetch code that could decode the message

after it starts running.

And I think it's kind of undesirable

to, you know, fetch

actual code and exec it or something like that compared to fetching an artifact like a JSON

doc that's representing a schema. So that's kind of how we landed on that. The other thing that I would point out about Avro that I think we were really kind of keen on is, this the different kinds of casting support.

So 1 of the things that you can do with Avro,

you can give it a different reader schema and writer schema. And if you give it schema that are compatible, the writer schema is the schema that was used to originally encode the data. The reader schema can be any compatible schema. And

what will happen is that Avro internally will figure out

how to map,

however the data was written into however you're trying to read it. And we use that, to provide some invariance on our Kafka topics.

Of what we're able to really do is say that if you start reading from a Kafka topic with a reader schema, any schema that ever writes into that topic will be compatible with all of the reader schemas that have ever been used with that topic. So you'll never get a situation where you have data in that topic that you can't continue reading. And that that lets us do a lot of really, really powerful things around compatibility

where, like, 1 of the biggest issues that I think we face previously

at Yelp in general with with data processing and in in particular, log processing,

with JSON was schema is changing

without the downstream consumers knowing. And this more or less completely resolves that problem because you're always guaranteed to be able to continue reading that data.

We we do do some workarounds for dealing with incompatible schema changes, but that's that's a very explicit compared to

making a implicit change to a JSON file and then eventually a consumer breaking in production. This is the kind of thing that we can actively alert on instead of, for example,

waiting until it's it's a real production issue.

And throughout the course of performing all of this design work and implementation,

was the original plan to open source the end result or was that a decision that was made as you came closer to a more cohesive system and realizing that there were pieces that could be more generally useful outside of Yelp's particular use cases? And then once you did come to the conclusion of open sourcing, what you had, what were some of the challenges associated with making sure that the pieces that you were releasing were properly structured so that they could be used by people outside of Yelp without necessarily having to replicate the entire ecosystem and infrastructure that Yelp is using?

For this project in particular,

I think that we did not actually intend to open source it. And in in retrospect, I think there's a lot of things that we would have done differently had we intended to open source it from the beginning. Concretely, we ended up building on a lot of internal

tools and packages

and

really fitting it in well with the Yelp ecosystem.

The flip side of that though is that it works less well externally

because of that

relatively tight coupling with the way that Yelp does software.

And, actually,

the reason that we ended up open sourcing the system was really to give some other examples of how we solve some of the problems

more for the benefit of people that are developing other things,

open source. For example, like the Debezium project. I think there's probably some crossover between what they're doing and what we've done, in many cases, and I think we've solved a lot of the same issues. So,

I think the hope in open sourcing it was more that some of the work that we've done there would be useful in other people implementing things in the open source ecosystem than something that you'd necessarily wanna run directly because of that type coupling with the Yelp Yelp ecosystem and infrastructure.

If we were to do this again, I think especially

given how limited the development was in in open source, and this kind of goes back to the build versus buy question at the start of the project. I I think what we'd

we'd likely do if we did it again today is build it as an open source first project.

You know, I think, you know, hindsight being 2020, it's easy to say, like, oh, yeah. We definitely

should have done that to begin with. But, at the time, it seemed like

kind of unclear that other companies had the same

sorts of problems around streaming and also how

how developed the streaming ecosystem,

would become in, you know, a relatively short period of time. So I actually think that what we've done is is less than ideal as far as the open sourcing stuff. And

and had we done it today,

we probably have taken a pretty different approach.

But in any case, I still hope hope that it will be useful, at least some of the concepts for people implementing other things like that. And, again, I really think could be a great example of that. Like, especially with some of the stuff that we've done around being able to do

consistent snapshots of databases while they're running,

without blocking or anything like that. I I I think that work,

in particular, something that could be extended to other places.

So for anybody who's facing

a project that's similar in

scope

or intent, whether it's for a greenfield project and they're just trying to figure out what are the different pieces of infrastructure that they're going to need or the ways that they're going to be manipulating or interacting with their data. Or for somebody who has an existing system and they're trying to retrofit a data pipeline and data management architecture into it, what are some what what are some of the pieces of advice that you would have for anybody undergoing that general process?

So the general advice that I would have on retrofitting a data pipeline or certain data pipeline kind of project, would probably be to really take a hard look

at at the infrastructure that's been developed in open source in the last couple of years. I I think in particular, a lot of the stuff that Confluence developed,

is of really, really high quality.

Kafka Connect and their schema store and

a bunch of the other open source infrastructure like Flank, Apache Beam, etcetera. I I I think all of that is something that can be built on,

really, really, really easily today compared to to the situation,

that we were in about, you know, 2 and a half years ago when this project started. So I think if I were looking at this today, I would definitely start with a pretty thorough overview of, like, what exists in stream processing. The other thing that I would point out is that,

just in general, distributed systems are pretty hard to get right. So

I would it's really easy to look at at systems and say, oh, this looks like it's gonna be relatively simple. But I think for a lot of the stream processing stuff in particular,

it ends up being a lot harder and taking a lot longer than anybody kind of initially thinks. So I would definitely, like, caution that on the build versus buy side,

almost certainly, like, look really, really hard at the buy side today. It's much more developed than it has been in the past, and I think it's getting better all the time. And contributing to that, I think I think is a great way of reducing costs and getting faster time to market and also,

reducing some of the ongoing maintenance costs. Like, I I'd point out that software maintenance in general is probably

67%

of the total cost. Something that we talk about at Yelp pretty regularly,

about

especially when making a build versus buy decision,

that, you know, if you build it, you then have to maintain it. And that's actually much more expensive than just,

the part of the project where you're building it. So if you're not accounting for that, it can kind of skew things pretty significantly.

I would definitely caution before

building lots of infrastructure here and really try to use what the community has already built and extended if possible.

So are there any pieces of the overall project or infrastructure that we didn't touch on that you think we should cover or any additional subjects that you think we should talk about before we start to close out the show? So in general,

where we're

going with our infrastructure at this point

is really into advanced stream processing. And

what I what I mean by that is using tools like Apache Flink and Spark Streaming and Apache Beam,

to provide a more rich set of primitives on top of

on top of stream processing so that we can do

really advanced real time computations

and start doing real time analytics.

And ideally to start doing,

some real time SQL. And I I think there's a lot of opportunity,

with real time SQL to really change the way

that a lot of modern data infrastructure is built. And, 1 of the things,

that came up at,

the Kafka summit that was pretty recently going on in New York was,

Confluent folks effectively saying that streaming could be as big as databases and maybe bigger than databases. And I I think that's actually true,

especially when the primitives get a little bit better. And the

way that I kind of see that going is

as we start developing some of the stream SQL stuff, being able to convert a lot of the things that were batch oriented previously

and relatively difficult to write, kind of expensive to run into

just very simple

stream SQL implementations,

that are running in real time and spitting out results as they come in. And I think that's gonna enable a ton of new applications,

and also potentially reduce some costs and complexity. I I think it's much easier to think about something like a

streaming SQL statement than something like a map produced job.

So

we're we're looking actively at that, and I think, overall, there's been a ton of progress even in the last 6 months in Apache

Flink on

stream SQL. And I I think that's 1 of the most exciting areas, and that that's where we're gonna be spending a lot of time in the next year.

Okay. So for anybody who wants to follow the work that you're up to and get in touch, What are some of the best ways for them to do that? The Yelp engineering blog,

is

pretty much updated once a week or so, and that's got a lot of useful information about what's going on at Yelp. And, we try to write up all of the major

infrastructure projects that we're doing on there. So that's that's a good way,

to stay up to date on what we're doing.

I'm personally always happy to answer questions or talk to other people that are working on this stuff. My email is justinc@yelp.com.

So if if anybody has questions, I'd be happy to talk.

Great. Well, I really appreciate you taking the time out of your day to share some of the lessons that you've learned and work that you've done in this overall process. It's definitely

a very interesting

subject area and 1 that it's often easy to find bits and pieces of the overall experience in various blog posts from different companies, but trying to get an overall sort of holistic view of this kind of project from start to finish is usually fairly difficult to come by. So it's definitely great that you took the time to share it with us. And I'm sure a lot of people will be interested to hear about some of the sort of deeper details that don't necessarily get covered in a blog post. Yeah. Absolutely. Thanks for having me. I'm definitely looking forward to hearing about all kinds of data infrastructure things now and in the future. Great. Well, thank you again for taking the time, and I hope you enjoy the rest of your evening. Thanks. You too.

Bye.

Data Engineering Podcast

Summary

Preamble

Interview with Justin Cunningham

Keep in touch

Links