Summary
As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- Your host is Tobias Macey and today I’m interviewing Andrew Gross, Bobby Muldoon, and Anup Segu about they are building pipelines at Yipit Data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what YipitData does?
- What kinds of data sources and data assets are you working with?
- What is the composition of your data teams and how are they structured?
- Given the use of your data products in the financial sector how do you handle monitoring and alerting around data quality?
- For web scraping in particular, given how fragile it can be, what have you done to make it a reliable and repeatable part of the data pipeline?
- Can you describe how your data platform is implemented?
- How has the design of your platform and its goals evolved or changed?
- What is your guiding principle for providing an approachable interface to analysts?
- How much knowledge do your analysts require about the guarantees offered, and edge cases to be aware of in the underlying data and its processing?
- What are some examples of specific tools that you have built to empower your analysts to own the full lifecycle of the data that they are working with?
- Can you characterize or quantify the benefits that you have seen from training the analysts to work with the engineering tool chain?
- What have been some of the most interesting, unexpected, or surprising outcomes of how you are approaching the different responsibilities and levels of ownership in your data organization?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned from building out the platform, tooling, and organizational structure for creating data products at Yipit?
- What advice or recommendations do you have for other leaders of data teams about how to think about the organizational and technical aspects of managing the lifecycle of data projects?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Yipit Data
- Redshift
- MySQL
- Airflow
- Databricks
- Groupon
- Living Social
- Web Scraping
- Readypipe
- Graphite
- AWS Kinesis Firehose
- Parquet
- Papermill
- Fivetran
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes. And their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
That's immu
[00:01:45] Unknown:
t a. Your host is Tobias Macy. And today, I'm interviewing Andrew Gross, Bobby Muldoon, and Anoop Saghu about how they are building pipelines at Yipit Data. So, Andrew, can you start by introducing yourself? Sure. My name is Andrew Gross. I am the head of data engineering at Yipit Data. I've been with the company for about 9 years during quite a long transition from a daily deal company all the way through an alternative data research firm.
[00:02:08] Unknown:
And, Bobby, how about yourself?
[00:02:10] Unknown:
Hi, Tobias. Thanks for having us. So I start off my career in investment banking. So I was an analyst in equity capital markets at Credit Suisse for about 6 years. I say that because it's a little bit of a nontraditional course for data engineers. So since then, I've been working at YIPA Data for 5 years, many different roles. Right now, I'm working on Andrew's team, data engineering.
[00:02:31] Unknown:
And Anoop, how about you?
[00:02:33] Unknown:
Yeah. Hey, Tobias. My name is Anoop. I have a very similar background as Bobby. I kind of started off my career in investment banking being an analyst at Citi, and eventually, you know, about 2 years in, decided to make the switch to go into engineering. And so I started YupiDate. I've been there for about 4 years, and now I'm working on the data engineering team.
[00:02:53] Unknown:
Going back to you, Andrew, do you remember how you first got involved in the area of data management?
[00:02:58] Unknown:
Oh, boy. So this was probably about 4 years ago at this point. We'd been using a lot of Redshift, which we had gotten to from MySQL as we accumulated ever more amounts of data. While I, you know, helped a little bit with setting stuff up on Redshift, I wasn't super focused on it. But we knew that Redshift wasn't really working well for our what our company needed and kind of how we were structured, which we can get a bit into later. And so the CTO kind of sat me down 1 day and was like, alright. I need you to basically go into a room, figure out where our data infrastructure is gonna be in a few years, and, you know, make it happen, which is a very kind of open ended task, which is pretty great to be able to get. And so I kind of sat down and started doing research, spending a lot of quality time with, videos from Spark comp and things like videos from Netflix's data team, stuff like that, and really plotting out kind of where the future was gonna go from there. And from there, spending you know, started building out some of the initial pipelines, understanding a lot more about the greater ecosystem of tools, options, languages, all that sort of thing.
[00:03:52] Unknown:
And, Bobby, how about you?
[00:03:54] Unknown:
Yeah. So I joined Yupit Data around 2016. As I mentioned, I was sort of transitioning from a role where the most technical task you do is build Excel models. And so my first role with GIP Data was as a data product engineer. I think Andrew will touch on this a little bit when he talks about sort of the history of the company. But at that time, we had engineers sort of working side by side with our product analysts to create data driven research. I was kind of owned to some technical components of that. Since then, I've sort of gradually moved closer towards the infrastructure side of things, you know, working closely with Andrew on our data lake and sort of setting up the data platform that we use today. These days, I spend a lot of my time on sort of working with third parties to set up third party data pipelines, as well as working with our finance team on sort of internal reporting and building out sort of internal reporting pipelines, cost allocations, and things like that. So that's sort of my area of focus.
[00:04:46] Unknown:
And, Danube, what was your introduction to data management?
[00:04:50] Unknown:
Yeah. So I have followed a really similar path. You know, I started at YIP Data as a data product engineer. And somewhere along the lines, made the switch into working on internal tools. And 1 of the core tools I started to work on was airflow and how we use airflow to do all of our data transformations. I think we'll probably touch on this later, but the gist of it is I was very exposed to kind of Redshift as well as Airflow. And somewhere along the line, Andrew was telling us like, hey, let's migrate our stack to Databricks. And I started getting more involved in the data lake, the data architecture, the data warehouse, all that kind of stuff. And still working on Airflow today and definitely still 1 of the man developers and managers on that system.
[00:05:29] Unknown:
Digging more into Yupit data itself, can you give a bit of an overview about what it is that you're building there and maybe some of the history of the company?
[00:05:37] Unknown:
So like I said before, I've been at Yippet Data for about 9 years. The company existed for probably 10, 10 and a half, maybe 11 total. In the beginning, you might remember yippet.com, which was a daily deal aggregator. So around the time that Groupon was going to IPO, we had written a blog post basically saying, can't remember if it was for Groupon when everyone was against Groupon or against Groupon when everyone was for Groupon, but it was basically a very counterintuitive post. And it was based off of the data we had from scraping because although we were only offering daily deals to our customers in the US, we're actually scraping all their international businesses as well. And so we had very good data to basically say, hey, Groupon tells us every day the number of deals they sell and at what price point. So you add these numbers up and get a really good idea of their revenue and how it's doing. And so we had written a blog post on this, and a bunch of investors, like hedge funds, things like that, had reached out saying, Hey. Can I get the data behind this so that I can, you know, make some investing decisions, put some position on the stock, something like that? And of course we were, you know, happy to sell it to them. In the beginning it was a more a small side business. It had, you know, 1,000 of maybe tens of $1,000 a month in revenue. But our main business was still kind of the consumer focused, you know, here's your daily deals for the day business. And over time, as we saw the daily deal business stagnating, we realized that, well, we're actually pretty good at scraping these websites and getting this data. And a lot of people want this information on public companies so they can help make investing decisions.
Around that time, 1 of Jim or Vince or both of their friends basically said, hey. We're looking for some info. I think it was on maybe eBay, something like that. Can you get this information? And so we started building an eBay product going and scraping basically every auction every day, seeing what the price it's sold for, if it's buy it now, all that kind of information. By the time we were done building the product, they had actually moved out of the stock, so they didn't want it. But now we were kind of sitting on 2 completed products. We had Groupon. We had eBay. And it kind of dawned on us that this is something that we can actually build a business around. Like, we're good at scraping websites. We have the technology to do this, and there's a customer base who is willing to pay for this information because they can't get it themselves. And this kind of started our journey into what is now called alternative data. You know, originally, we're much more focused on just getting the data and having a a little bit of a report, a little bit of a breakdown on it. The company today is much more focused on being a kind of your almost your outsourced research firm. We can get a little bit into the landscape of hedge funds in a bit, but the general structure of the company is that we have, you know, analysts who are empowered to be able to collect it, whether it's from a website or we're getting it from third party sources, analyze it, and then produce a report that can be given to investors. Our main takeaway is not, you know, here's a data file or a data dump, but instead a report here's context around the data. And then, you know, if you need to plug this into some models, we can provide, you know, a Google Sheet or Excel style file. And for the more technologically advanced investors, we can also give them, you know, here's a dump of parquet or CSV files that you can ingest into whatever your platform is. So the company has really kind of evolved quite a bit technologically over time, kind of going from our initial, we're scraping a few sites, you know, thousands of deals per day to now, you know, scraping millions and millions over several 1000000000, like, data points per month just from the web scraping side alone. So it's been a pretty fun evolution on the data engineering side of how do you capture all this data, how do you analyze it effectively, and how do you kind of, you know, make your business able to grow and not be constrained by these problems?
[00:08:57] Unknown:
And with web scraping as the primary data source, what are some of the ways that you're dealing with actually being able to scale that? Because I know that there are sites that will do rate limiting, and you then have to go down the road of going through proxies or figuring out if there are APIs that exist that aren't necessarily public that maybe you can just hook into the actual back end API that they're using to feed their front end single page app? And then once you gather the data, what have you found to be some of the most useful ways to store that effectively to then be usable for downstream systems?
[00:09:33] Unknown:
And I'll let, Bobby and Anoop chime in here in a bit because they also have a lot more experience doing this. I was only there in the beginning. But to start off with your question kind of around, how do you scale some of this? So we're kind of lucky that we had this existing team in place as we're doing the transition from kind of a consumer facing website to a data company. So we already had a lot of infrastructure in place with a system we built internally called YAWS, which is kind of a little pass on top of AWS. It was very easy to say, I'm gonna create a package in Python. I'm going to push it to GitHub, and it's gonna get automatically deployed to however many servers I specified, automatically update. I don't have to worry about any deployment stuff. So we could easily kind of isolate individual products and individual steps in products. And we built an internal scheduling tool called pipe app, which is basically, you know, I wanna run this task at this time. Here's the arguments to it, and then it would basically go find, like, a little function that we had defined somewhere. At the associated time, run it, And then you would say, okay. After this 1 finishes, run these next 2 pieces, similar stuff like that. I'm sure there's very similar tools out now. This is just something we happen to build in house, and it was very helpful. In terms of scaling kind of the scraping part, you know, Python, if you're in we're using a lot of the tools like Juniper and stuff like that, where it kind of handles the background concurrency for you even though Python is a bit more single threaded.
It was easy to spin up enough AWS machines to have the raw horsepower, but, of course, like you said, you're gonna get rate limited if all your requests are coming from an AWS IP. A lot of websites won't even allow you to view any pages. They just automatically block all the ranges AWS publishes. So, of course, it means you need to start using proxies. So 1 of the first things I worked on for the data side was building basically a centralized proxying system where, to the developer, you would basically say, I'm going to generate a client and the client follows the requests API. It basically is just wrapping it and I'm just going to say what my product name is. So say my product name is eBay And then on the back end, the proxying system is basically saying, okay. Here's all the proxies you defined for this system. You know, we've scheduled them in order so you don't reuse the same 1 multiple times. You have ways to say mark this 1 bad, remove it. You can filter on region or the provider of the proxies, things like that. In terms of actually, like, getting the proxies, it was just going around the Internet, finding people who are doing virtual hosting, at least initially doing virtual hosting, other cloud providers, things like that. It's evolved quite a bit since then, but I've kind of moved away from that part, so I don't wanna go too deep on it. But there are certainly providers out there that can help you with this that have a network that gives you a really wide reach and a lot of fine grain control over what you're able to represent yourself as because that's very helpful especially for some sites where, you know, say you want to look at a travel site and you want to know what does it look like the US versus Europe? Do I get different prices? Are there different deals? Are they showing me different things? Stuff like that. So on the proxying side, it was a lot of infrastructure work to make it very easy to automatically proxy requests without a user needing to sit down and fill out the proxying dictionary for the Python request library or something like that every time because that becomes very painful and very you know, it's an easy task, but if you force every engineer, every analyst to do it, you're just asking for them to get frustrated. I don't know if Bobby and Anup, you wanna jump in here a little bit more of the scaling tools there.
[00:12:32] Unknown:
Yeah. Actually, 1 of the things I wanted to kind of touch on and give some context to is is not so technical in nature. But I do think it was important for us, especially in the early days as we were trying to scale the data business overall. And I think the history is important because we started out, as Andrew mentioned, sort of collecting data from Groupon for our ecommerce business. This data wasn't exactly structured in a way that's kind of led itself towards data analysis. It was sort of structured for another purpose altogether. And so as we're building out this side business, we're sort of working with the data that's not in, like, an optimal structure at all. And we sort of just have an analyst working with data and trying to create a report, basically, out of it to predict, okay, can I take all this ran this Groupon data and figure out what their GMV is, which is, you know, gross merchandise volume?
And so I think 1 of the early innovations that we kind of came upon because this was a terrible process was that we need to be very clear about separating the raw observed data that we're observing on the Internet from the actual analysis that's being done. And it sounds like a pretty simple thing, but in the early days, that was something we were not doing. I think 1 thing to emphasize is that 1 thing that has helped us is creating that very clear separation. Okay. These are the facts that I'm observing. So if I go to the eBay website every day, this is what I saw on the page. I'm not going to make any judgments. I'm just gonna record the data. It's gonna be append only. And then my analyst can kind of pick it up from there and apply any judgments he wants or she wants, and they can, you know, make changes to their assumptions. They can re you know, rerun their analysis, and they don't have to worry about any judgments that might have been made at the time that the website was scraped. So I think that's, like, not a technical thing that you might think about, but it was very important for us to create that clear separation.
A few other kind of innovations that happened over the last, you know, few years, originally, we had all of these product teams sort of working separately. We were able to migrate from MySQL to Redshift, which at the time was, like, a really big jump for us. That allowed our analysts to query the data a lot faster, but it still sort of siloed off each individual dataset. And I think as the company has matured, it became more more important that we were able to compare datasets between, you know, different companies. And so in in the early days, that was really painful to do with a lot of Redshift clusters. And it I don't wanna go into the details yet, but Andrew can talk about it a little later. It sort of led us to kind of build out a more modern data lake. And that really opened the door to us starting to analyze data beyond, you know, web scrape data, where now we can incorporate credit card data, email data, app data, you know, you name it. And our analysts sort of have a just a endless pool of of data sources. And the we view the website or the Internet as just yet another data source that they can access to the extent there's valuable data that they can extract from that.
[00:15:20] Unknown:
And with being able to load all your data into the Data Lake now and you have this robust infrastructure for being able to handle web scraping and using that as a data source and for data gathering, what are some of the other sources that you're looking to for building up these datasets of alternative data that you're then able to sell out to your consumers?
[00:15:40] Unknown:
Yeah. So let me just continue a little bit. I mentioned credit card data, email data. These are datasets that have been, you know, sort of worked on, used by the investment community for a while now. But I think 1 thing that we have discovered is that these datasets are challenging to work with. And when most of our clients just have a couple engineers at most, we're working with hedge fund clients that have very little resources to do this type of work. And so we found that even for datasets that are sort of already being, you know, absorbed by the market, there's a lot of value in just getting that data in a clean format, analyzing it, and producing a report. Literally, our product is essentially a PDF report that we send to some of the biggest hedge funds in the world. And so the value is not necessarily at the raw data. It's also more at the cleans data level where we've already done some analysis, and we're gonna give you our insights rather than just drown you in raw data and and, you know, you have to figure it out from there.
[00:16:34] Unknown:
With this combination of needing to have robust infrastructure for being able to collect the raw data, but also having the team of analysts able to interpret and digest and present the data to your end users. Can you give a bit of an overview about the way that your data teams are composed and the structure that you use to ensure an effective communication pattern and being able to maintain agility and velocity, given the number of different data sources that you're working with and the rate of change for the websites in particular?
[00:17:07] Unknown:
I think this is actually probably 1 of the biggest, I mean, innovation makes it sound like we're the only ones who came up with this, but it is 1 of the biggest factors, I think, in the success of the company is kind of the approach we had to our analysts. So a lot of companies you may hear about, okay. There's the data team. There's the analyst team. There's research team. We there's a hop between each of them, and you need to, like, pass things off and ask them, you know, maybe write up a spec or you maybe you've outsourced the web scraping or something like that. And each of those hops adds a lot of friction and a lot of loss of context between them where you can spend a lot of time just kind of ironing out the fine details even without actually, like, getting any useful information. And what we've focused on at Yippa Data is training the analyst and empowering them as much as possible. So we mentioned before that we started out that Bobby Nupol started as data product engineers.
Our initial team structure was something along the lines of, you know, maybe we have an eBay team. There's an analyst or 2 on it, and maybe they're part of a larger team, and there might be an engineer assigned to those, that larger team. And so they're gonna be the ones in charge of helping script the website. They're gonna be sitting next to the analyst. The analyst might say, hey. You know, I see that there's a new button on here. Can we record where we're seeing that? Or maybe there's, you know, can you see what happens if we can get this in this different language? Can we check the mobile app kind of stuff? And the engineers could do that. But, again, it's a conflict between the goals. So the analyst, of course, wants to get the data. They want to do the analysis. They want to find out what's going on, pass that along to our clients.
The engineer wants to scrape a website. They wanna build a site that is, you know, write some good clean code, maybe get some tests going, like, build something that is satisfying to an engineering mindset. And those can sometimes be in conflict, especially like what you say with web scraping. A web scraping is very difficult to do in a, like, very formalized manner because you're basically building on sandy, shaky foundations. The website you might be testing against can be changing at any time. There may be errors introduced by your code. There may be errors introduced by the other end. And it's hard to set assumptions that can't be invalidated because you're not really in control of the whole system. And so we realize is that what's more important than having, you know, the best tool for scraping a website is instead having the most agility around scraping the website. And that pushed us from having the engineers on the team doing the scraping, maybe having, you know, an Asana board or something like that, some project management, some queue of tasks to go through, to having the analyst be able to do the web scraping themselves.
And that's a really big ask to do. Just say, oh, yeah. By the way, go scrape this website. It is something that we focus a lot of time and energy on to build tools and train the analysts so they're able to scrape websites more effectively, and not just websites, but kind of do all their work. So analysts at YIPA data, they start and come in with a very intense, like, month, month and a half long training process where everyone learns Python, everyone learns SQL, everyone gets a walk through of scraping websites, building scrapers, analyzing the data using tools like Airflow. And so they're able to go very much from scratch of, I wanna scrape this website. I think there's interesting data. All the way through here are the clean tables and then, you know, some output from this that I either wanna, to send to a client or send off to a research analyst or have someone else collaborate with me.
We built some really incredible tools for doing this. 1 of them is a tool called Readypipe, which so previously a lot of our web scraping was done, you know, there might be a GitHub project for each site you're scraping. We had kind of a framework, like I mentioned before, called pipe app where you could kind of say this is the function I want you to run with these inputs. You chain those together. You say, okay. I'm gonna go to the site map and then I'm gonna queue up an item for each item in the site map and go to each 1 of those or go to each category or something like that. However, it was all very much like a local project. So you had to have a local development environment, then you had to push upstream, then you might have to figure out, okay, where's the you know, I have some maybe some graphite graphs monitoring the success rate or failure rate of some of these things. I have to go find those somewhere else. And, you know, what happens if my code is out of data, my environment's out of data, I can't load this? So we realized that if we wanted the analyst to be able to focus on their core competency of gathering the data, we we need to build a platform that supported them, which is what ReadyPipe is for. It was originally the goal was to build it as a SaaS product that we kind of sell to other people. We've kind of backed off from that and mostly used it internally at this point, but it has been very successful. It's basically a JupyterLab's interface for your programming, but the back end is very much, like, queued in through SQS tasks or our own queuing system on pipe app, kind of providing an all in 1 experience where you can write the code. You can immediately run it, get quick feedback, and then you can say, alright. I want you to run this across, you know, 10 machines with a 1,000,000 tasks starting every day on Thursday and then put the data into this table or these tables, things like that. Along those same lines, thinking of things like, we don't want an analyst to have to worry about, oh, is this field an int or a string? Do I have to, like, set the schema for this? Instead, we want them to be like, alright. Here's the fields I got. Send them along, and I wanna start scraping the next thing. And so we built a system that'll automatically take the data they're passing. All they have to specify is the table name and the database, and it'll automatically figure out the types for them. It'll automatically take we use Kinesis Firehose quite a bit. So it'll take kind of all the small outputs, JSON data, and automatically convert them to parquet, do the kind of compaction you need to get really efficient longer term storage.
And, you know, if there's issues with the schema, it'll put it, you know, it'll mark it and flag it so that they can recover it later, things like that. So they are focused on their core competency, which is I want to be able to scrape and get this data and analyze it as opposed to, my PIP install failed for the 15th time because I have, you know, Python 278 instead of Python 2713 or they fixed some SSL issues or something like that. Because, like, those are problems that engineers can solve, but we don't want our analysts wasting time on them. Same thing with database migrations like I mentioned before. So RayPipe is very much a platform designed to emphasize what you need to do for scraping in terms of having monitoring right in front of you, logging, you know, scaling of your stuff, and trying to hide away all the annoying parts of you know? An analyst doesn't care about AWS auto scaling groups or whether they're running on Spot on an m 4 or an m 5 or how we're maybe Tetris ing Docker containers together or something like that. They just care about, I know I wanna go faster or slower on this process whether my data is or isn't working.
[00:23:04] Unknown:
Yeah. Just to add some additional context with what Andrew is describing, I think 1 of the benefits of really thinking more about, do we need engineers to actually work on these product teams, or can the analysts just do it themselves? I think 1 thing that we discovered is that as we build better tools and sort of established processes for building web scraping systems, we realized that the product engineer role was becoming a smaller and smaller piece of the puzzle. It just wasn't worth it. Like, the the friction and the information that's lost in translation between an analyst, you know, creating a spec of, hey. This is the website I wanna scrape. These are the data points I need. Now go build it. That works. You can do that. But the problem is each engineer was sort of build a very bespoke web scraping system. They didn't really care about the data because they weren't the ones that had to analyze it after the fact. And it was sort of like, okay. Let me just do what I have to do to keep this working, but, you know, the analyst will figure it out from here. I think 1 of the benefits is that when you remove the engineer from that equation and you keep create more of a platform that the analyst can just define what they wanna scrape on their own, they have a lot more ownership of that data. And if you're analyzing data scraped from the Internet, it's really important that you understand where it came from, what assumptions might have been made, all of the different edge cases that you will care as an analyst, when you're studying such a sloppy dataset. It's just so empowering when they can just go to the website themselves. They can build, you know, parallel processes if they wanna scrape, you know, competitor sites, if they want to scrape something slightly different. They don't have to, you know, beg an engineer to help them. They can just do it themselves. And I think that has been so powerful, and it has allowed our analysts to really thrive working with such a messy dataset. And it's something, frankly, that our competitors have had a lot of trouble with because a lot of hedge funds, if they didn't need to work with us, they wouldn't. They would just scrape this data themselves and go on their merry way. But we've just realized that there's a lot of kind of challenges in working with this data. And the fact that we've been able to train our analysts to kind of write Python code themselves, you know, figure out how to build a scraper themselves, it's really taken them to the next level where they really understand the data at a very deep and intimate level. And that allows them per to produce better research from that data.
[00:25:18] Unknown:
Particularly with the fact that the source datasets are so subject to change unexpectedly and at short notice. How do you handle issues around data quality and alerting if there are fields that are no longer available? And then how does that propagate to the ways that you define and manage the schemas and decide what is the least common denominator of attributes that we want to be able to pull from these sites for being able to perform the downstream analysis.
[00:25:49] Unknown:
1 thing I wanna touch on with which you just asked about is what do you do if, you know, obvious, you're scraping a website. What happens if a certain field that you're relying on is no longer available? Well, that was a common problem when we had engineers sort of owning the system because an analyst would say, hey. What's going on? How come this is null everywhere? And then they would have to kind of find the engineer who built it. Maybe that engineer is not even working here anymore. And then you'd have to go through and figure out, like, okay. Why is this column null, this and that? Whereas now as the analyst that just owns the entire system, they're able to pay much more attention to kind of these things that can go wrong with websites. And they're the ones that are spending hours looking at the site, really understanding every little piece of information that's possibly valuable from there. And so I think just having, like, the engineer removed from that equation is 1 of the first steps of being on top of this and making sure that you can kind of as the website changes, you can change the web scraping system to kind of reflect that. In terms of how do you manage schema mismatches, I think that goes back to what I said earlier on about really treating the observed data as separate from the downstream analysis. So if for instance, there's a column that is, you know, no longer populated, we all continue that column will never disappear from our data, but you might have to add a new column to capture, like, a different piece of information that you're gonna have to rely on going forward now that the original column's no longer there for you. And then downstream, you can just kind of union the 2 datasets from, you know, before the website changed and after the website changed. And in all of the drive tables is where you'd kind of, you know, figure out which columns you need to rely on and make whatever adjustments you need to make. And this is also very common if, say, there's an outage where you might have missed, part of the month of scraping, you know, eBay's website. And so you'll handle all those kind of adjustments downstream.
And the observed data is just gonna be observed. It's what I was able to capture at the time I attempted to capture it, and it's nothing more, nothing less than that.
[00:27:46] Unknown:
The 1 thing I would add to that is that, yes, the the analysts are kind of responsible for handling all of these kind of data quality issues. I think the engineers have also done a lot to make that very easy for them. 1 of the things that we kind of invested in our scraping system altogether is we automatically generate kind of like an audit log of every single action our scrapers take, what item they pulled off the queue, what request they made, what was the status of that request, was there an exception raised? All of that is kinda streamed into tables, observed tables that our analysts can then do an archeological dig and see what exactly went wrong. I think it's an example of, you know, because we've kind of removed the engineers from the equation of managing these systems, they can take a step back and deliver something that's like wide scale improvement on what they can do and just make the analyst life a little bit more easy so that, you know, they can automatically detect when columns are null or records aren't being added because we have these tables and we can build, you know, dashboards or daily email reports kind of summarizing this information for them to digest. And it's all standardized and consistent and easily accessible whatever product you're kind of building.
[00:28:53] Unknown:
The mention of audit logs also brings up the subject of lineage tracking and managing the data catalog and being able to annotate the different pieces of information and understand how that can be reflected throughout the analysis. So I'm wondering what you're using for being able to manage the overall life cycle and perspective of this particular record that I'm working with in this Jupyter Notebook that's maybe 3 steps removed from the actual source data. How did that actually propagate from the original website? And what was the timestamp that it was generated so that I can understand that, you know, that this field isn't populated anymore, but it's because this was collected 5 days ago before the website deployed a new version and just being able to track the overall metadata and lineage of all that information that you're working with.
[00:29:43] Unknown:
I guess there's kind of 2 parts to this. 1 is how do we kind of bring the data into the system with this information, and how do we kind of structure our analysis of the data we have kind of keeping track of it. So the first part definitely gets back a bit more to those metadata tables that Anoop and Bobby had mentioned. We call them pipe up tables because that was the system we'd use for a lot of this in the first place. But, basically, every record that comes in, it gets tagged with basically what we call, like, a run ID or basically some idea of, like, what kicked this off. It comes in with a what we call date added, which is just a time stamp of this is when we recorded this record or basically pushed this record upstream from the very moment we got it. And then the other metadata tables, the audit logs we had, basically have that same ID propagated through there, so you're very able to trace.
I started with this command, it spawned these 2,000 other commands and each 1 of these started failing on this date. You can reconstruct all that from those audit logs, which, again, is very helpful for what Bobby said of, like, the archaeological dig style. It's like, okay. I need to go back and figure out what happened and when, because we do record all that information, which has been very helpful. In terms of how is this kind of managed for, like, the greater product, it is up to more each team. So we kind of keep all of this data available. We don't truncate any of this information.
If a team is saying, okay, on November 31st 2019, that this you know, that's not actually the date, but whatever. This website changed and we no longer started getting this, we're able to use a backup endpoint. We will often be keeping multiple things together, and that'll be reflected in a couple places. It might be reflected in some of the ETL queries that they have where there's gonna be a comment saying, you know, case if when. And there's a note in there saying, like, we lost access on this data, which is why we're splitting here. It might be that we say to our clients, hey. We lost access to an endpoint.
Here is our new analysis based off this other 1, and we're just not gonna be using this other 1. For those kind of higher level, you know, when did this piece of data change? It is, again, up to the analyst to be able to monitor that and be able to kind of understand and have, like, basically a run book for it, which is part of the reason why we like moving to Databricks.
[00:31:50] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt. Your mention of storing the history of all these elements also brings into question the life cycle management of the data and what the longevity of it is and how long it's useful for, particularly given that you're working with financial institutions that might want to back test theories about their investments from historical data?
[00:32:45] Unknown:
So for the web data that we pull in, pretty much keep that forever. Maybe there's gonna be a discussion at 10 year long datasets that, you know, maybe it's not useful anymore, especially if the website's changed quite a bit. But it's really more on a case by case basis. Like all the data that comes in, like I said before, immediately gets converted to parquet, compressed, put into S3 in a longer term storage format. You can use intelligent access to decrease your costs for kind of storing a lot of that data. But, you know, it is not uncommon that, you know, like we had mentioned before that, you know, maybe you have to change the methodology. Maybe you need to reassess the balance of, say, 2 categories or something like that. And you wanna actually be able to say, even to ourselves, how does this backtest over our historical data over 2, 4, 6, however many years depending on the product? And so while we're not as often delivering that old information directly to the client in terms of the raw data, the information about this is, you know, essentially, this is a table of the information used to produce the report on this month in 2018 or something like that. We still keep all of those around, basically, just tagged with the date because you're always going to want to be able to understand what went into your process and what information you are working with. That when you make changes, you can kind of see how it would have affected your previous scorecard, things like that.
[00:34:00] Unknown:
Digging deeper into the overall platform that you've built, can you describe the overall architecture of what you're working with and maybe some of the ways that it's evolved and changed since you first began working with it and the goals have shifted?
[00:34:14] Unknown:
Sure. I'll kick this off, and then I'm gonna probably hand it off to Anoop for some of the more interesting airflow stuff because he's done a lot of really cool work there. Like we had mentioned before, we were on Redshift. This is back before we were even using Kinesis Firehose. We were just doing nightly dumps from MySQL to Redshift at first just to get some faster processing for some of these large datasets, which at the time were hitting, you know, a few 1,000,000,000 rows, which is not huge by any means now, but for us it was at the time and certainly large for MySQL. We realized, okay, this isn't going to work. As we moved to using tools like Kinesis to do append only, we had these Redshift clusters, which, of course, you can expand to Redshift cluster, you can add more nodes.
However, it wasn't necessarily the perfect platform for us. It was a fantastically fast tool when it had the right care and feeding, when you set the distance sort keys on your tables properly, when you had, you know, aligned your various tables and how you were inserting into them and when you were running your queries and what the concurrency on your platform was. However, we had the problem of we have 10, 20, 30 teams that we don't want to interfere with each other. So each 1 of them is getting their own cluster, and it might only be 2 or 3 nodes of a couple terabytes of storage. And that's, you know, 1 that's a lot of operational overhead to see that everything is tuned correctly. You know, sometimes queries get stuck and you have to unlock tables. And then for a while, I was the 1 running around rebooting nodes or going in as an admin and unlocking those things, which is just a really unpleasant experience because, again, you're taking the power away from the analyst when they have to go basically wait on someone else. The other problem with it was Redshift at the time, I know this is not necessarily the case now, but at the time, there was 1 way to scale. You scaled by adding nodes. So if you need more CPU power, add a node. If you need more storage, add a node, which meant that our costs are also pretty out of whack depending on what product was doing because sometimes they might need more processing, sometimes they need more storage, and sometimes they weren't sure how much they were going to need in the future and planning that out started becoming an exercise in itself. And Redshift, at least when we were using it, adding and removing nodes was a kind of big deal in the sense of, alright, we need to stop our data flow coming in or, like, queue it up because, like, the cluster is gonna be down. It's not gonna be accepting reads or writes, which is a very painful process to go through. And so we're kind of looking for what solutions do we have from here. So we realized what are the parts we like? We like the append only thing. We like kind of streaming it in. Our analysts like being able to say, okay. I can make a change. I can start gathering data from this website And within, you know, 5 minutes, 10 minutes maybe, I can see, you know, what data is coming through. Am I gathering this? Is this working? Being able to minimize that kind of read eval print loop or OODA loop, however you wanna phrase it, for so the analyst can make decisions quickly and kind of focus on the product and iterating. But we also realized that we didn't really like Redshift because of the management of it and how we're using it and scaling it just wasn't a good fit for how it was designed. And so that's when we started looking at tools like Spark and Parquet. Like, we were really excited about the idea of being able to store data in s 3 because we had had problems in Redshift of, we have to increase this cluster because it has a lot of historical data, even if much of it isn't being accessed most of the time. The idea of trying to manage archiving tables and bringing them back in as needed was kind of a pain. Also looking towards a future where it's like scaling a cluster based on how much performance you need as opposed to, you know, how big the cluster can be at some point in the future if you need to add nodes was very attractive. And that led us to Spark, specifically Databricks. As we're kind of setting up and working with Amazon EMR, there was definitely a lot of rough edges there. And kind of when we stumbled across the Databricks platform, we realized that this was going to kind of allow us to accelerate the platform that we had for our analysts by several years instead of building it ourselves potentially on top of EMR, kind of going it alone. And that's when we kind of realized, alright. We're gonna keep the kind of streaming data stuff. We basically wrote our own data pipeline that's going to take this JSON data convert to Parquet, update the information in the Glue Metastore, keep it partition by day and have all this information up to date. Now we just need a tool that's going to be able to read that information in. That's where we started using Databricks.
The point where I got to was, you know, we set up Databricks where we have clusters. You know, you can access this data. You have, you know, this nice notebook interface that was a lot better than kind of the SQL IDE style with people kind of keeping random notes in their computer for what queries they were using. However, we needed to really think about now that we have this platform that is kind of wide open to us, but kind of, you know, there's almost not no constraints, but it was, you know, in my mind, Databricks is built for people who already had opinions about what they wanted to do. So if you wanted to read in a CSV data or parquet or orc data, you could do that. If you want to store it as a table, that's fine. If you didn't want to, you know, the s 2 location, you can do whatever you want, which wasn't really the case for our analysts. Like, we don't want them thinking about sizing their cluster or, you know, do they want SSD or do they want m 4, m 5, or Spot? We just want them to know that I want a bigger or faster or smaller cluster, and that's about it. And so we kind of really spent some time thinking through, what is the platform we can build on top of Databricks that's going to make our analysts really effective?
[00:38:57] Unknown:
Yeah. I think that's a really interesting segue into kind of the tools that we've built on top of Databricks. So just slightly backing up, 1 of my entry points in Databricks and kind of working with Andrew on this was I was managing airflow at this time. And in our Redshift world, we had our airflow, you know, system that was essentially running a bunch of queries on her various clusters. How our analysts kinda interfaced with this tool was that they would write a giant JSON blob of various SQL statements and somewhat describe a dependency chain of, hey, run this query after this query, and, like, enable and each query is essentially a create table as a SQL query statement.
And that worked, and that kind of ran into several of the problems Andrew was describing where, okay, all of a sudden I'm running this really expensive query on its Redshift cluster, and it's fighting for resources that would interrupt kind of our streaming solution, or the cluster would be, like, unusable for analysts to do kind of some of their EDA type work. And so once we kind of figured out with Databricks, hey. Here is, you know, somewhat preset clusters and basically just a small, medium, or large kind of T shirt size of a spark cluster that was very easy for analysts to use. We started to think about, okay, how do we move airflow to this system? So that airflow instead would use Databricks and spark instead of Redshift. And it was an interesting time because we were starting to see some really cool use cases for a more complex system than the 1 we had provided.
1 of the things that was coming up during this time was that we started to use email receipt data. And so having that access to that data across all kinds of various products was really important, but very difficult to do in retro because we had no way to propagate it. But now that we kind of have this shared centralized data lake in s 3 with all of our data, it was very easy in Databricks to start using these datasets across a variety of systems. What we realized is what analysts loved about Databricks was that kind of fluid notebook interface where they can jump in and describe using SQL or even some Python code the transformations. And they still wanted this way to use Airflow and essentially sequence their transformations in some order. So what we kind of built is a way to essentially, you know, let the analyst write their notebooks and the transformations that they care about inside those notebooks.
And then for them to simply describe, hey. Run this notebook after this other notebook or after these 2 notebooks and kind of have them artificially describe that dependency chain without thinking too much about the internals. And behind the scenes, you know, we kind of married the Databricks API with some custom Python code and airflow plugins to essentially generate airflow DAGs dynamically. And so for our analysts, their entire kind of workflow to ingest or analyze the raw data coming in from ReadyPipe or third party vendors into that clean derived data is all done in notebooks. And we've given them kind of shortcuts to essentially deploy or manage their DAG just within other kind of shared notebooks that they can use. And it's been really cool to see because now we have a much more unified experience for analysts. Before, you know, a lot of the logic and business logic analysts were working on was very much siloed. Sometimes it was just in a random kind of text file on their device or on their computer. Or sometimes, you know, it's documented in a random place that maybe other analysts different teams wouldn't even know about. But now that that's all in notebooks, we can easily share this content, these this logic across all these teams, and they can all mutually benefit. And so we've seen, I think, huge advancements in our ability to analyze data and essentially propagate best practices through the system. And I think the other, I guess, understood benefit in all of this is that our Redshift problem of resource contentions kind of went away because with airflow now, when it interacts with Databricks, it essentially spins up a dedicated spark cluster for the duration of that notebook job. And so no longer were analysts having to fight with resources or be concerned that the query they write wouldn't run.
Certainly, you know, there are still a lot of optimization questions in terms of how can they better tune their query. And over time, they get better at that. But large it was a, I'd say, a step level improvement in terms of the system before versus now with Databricks because they simply just had to describe their transformations and decide, you know, specify the dependency chain and what kind of cluster that they want. And that's it. And they have a full ETL workflow without, you know, any engineering support and fairly reliable at that.
[00:43:07] Unknown:
For being able to use the notebooks as the unit of compute within the airflow DAG, have you also looked at using something like paper mill for being able to also parameterize them and maybe having the outputs of 1 note the notebook flow into the parameters for the next notebook?
[00:43:23] Unknown:
We have not used that too much. I think 1 thing that we've kind of leaned on, and this is somewhat of a cultural thing, is that we kinda treat each step in the airflow DAG or that task very independent of 1 another. And this is kind of legacy from how we were managing Redshift where here your airflow task is simply running 1 SQL query, and that's it. And then the next step is running another SQL query. It's up to the analyst to figure out, okay, I know that this parent or upstream task created this table so that I can use those columns. And it's very much kind of independent in that way. I don't think we've ventured too much into how do we kind of propagate parameters across tasks. I think that's something that Airflow itself has kind of lent itself to kind of have tasks be very independent of another.
But, yeah, it's it's certainly an interesting idea. I think for us, what we've seen is that the notebook environment is very flexible and analysts can kind of describe what kind of data they wanna propagate. There's something shared, you know, they can put it in a shared notebook and use it across several of their other steps in their airflow DAG. But we've really left it up to them to figure out what really suits their use case because we're talking about, like, a data catalog of, you know, 6000 databases and, you know, almost 60, 70000 tables. So we really try and, let the analysts figure out what's best for their use case. Because with us, it's it's really about managing a variety of datasets and different types of systems at scale.
[00:44:44] Unknown:
With the analysts being the people who are driving the majority of these different data pipelines, what is your overall guiding principle for being able to build the systems in a way that is approachable to analysts and that's easy to onboard newcomers to the company? And how much of the underlying capacity of the system do you feel needs to be exposed or ends up getting exposed as far as some of the guarantees offered for the processing capabilities or edge cases that they might need to be aware of and where things might go wrong with their workflows to be able to then produce the analyses that they're driving for?
[00:45:23] Unknown:
For us, you know, it really came down to what is most important for the analysts to be able to do. Because they were using airflow in a pre Databricks world, they were already kind of used to some basic types of inputs that they knew they had supply the transformation, kind of the destination of that derived data, where what kind of table it goes to, and that dependency graph. And so, you know, for us to port that over to Databricks, it was very easy to kinda keep that world. And for the most part, I think those inputs haven't really changed. I don't think analysts want more complexity. There's certainly been, of course, features that we've added over the time.
But, really, it's been, you know, what's really important to get their job done. And if those decisions need to be in front of the analysts, then we make it apparent in the platform. But we really do try and be thoughtful about what we expose because we don't wanna give too much. I think Airflow specifically is an example of a really flexible system that can be used in a lot of different contexts. And for us, we want a very high level abstracted kind of interface that lets them focus on the transformations they need. And because it is notebook based, know, they can run any kind of arbitrary Python and SQL inside those notebooks. So they have quite a degree of customization and flexibility. And I think the 1 thing that we observe is there's, like, a repeating pattern. If we see some kind of Python code replicated across several notebooks, several DAGs, then we start to think about, okay, maybe we can abstract this away. Maybe we can put this in a more templated fashion so that they don't have to do this every time. But that's typically been kind of the iteration that we've done in airflow.
[00:46:49] Unknown:
I just wanna add something a little bit out of left field, but I think it's relevant. 1 of the things that we have focused on, and I think in your opinion, Andrew covered this a little bit, but it's really a strong culture of ownership. And so for our analysts, you know, we have 60 plus at this point. We really want them to feel they have the full control. They are responsible for revenue they bring in with their products. So they decide how they wanna focus their time, which products they wanna build. In terms of the transformations that they're defining or the web scraping systems that they're creating, they're gonna own any value that they create out of, you know, collecting that data, analyzing that data. We're also gonna fully allocate the costs that they generate based upon the projects that they create. And so we try to educate them on the platform that that we've built, the tools that we have. Here's how you use it. But then it's up to them. So we really don't wanna build too many templates and force them to kinda work a certain way. We really just want to provide a pretty straightforward platform that they understand how to use. They know how it's gonna generate cost. They understand if a process that they created is valuable or not. And then it's up to them to kind of generate profit from that. And it's worked really well. And I think what's been great is that we hire so many different types of data specialists and data analysts and research analysts.
It's kind of like a great to watch them kind of grow their different skills, and they can decide to kinda spend their time the way that they want to without being sort of forced.
[00:48:14] Unknown:
In your experience of building out this diverse set of tooling to empower analysts to be the driving force for the actual data life cycles, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building out that platform and tooling and growing the organizational structure that has gotten you to where you are today?
[00:48:36] Unknown:
There's a lifetime of learning, at least 9 years or 10 years of it so far. So a lot of it is, you know, like you said before, the ownership is very important, which also means that you're gonna have to give up some of that as an engineer. There are times when I'm helping someone debug a particular part of SQL or something like that. You know, maybe the SQL looks really bad to me. I'm like, oh, you shouldn't really do it this way. And, you know, I will help them refactor it if they need help. I'll help them optimize queries if they need help. But ownership kind of goes both ways. It allows them to build these systems without your interference, but it also means that you shouldn't be interfering with them. You're giving them the tools to kind of understand the whole framework, the whole world. So, you know, the idea that Bobby had touched on previously of they get to see kind of how their revenue comes in from the products they've sold. But right next to that, they also have, you know, here's the cost allocated to you for your storage, for your computing time, for your web scraping time and all that. So you can see, like, oh, my product is making, say, $2,000,000 a year, but I'm incurring $1,600,000 in tech costs. Like, is there a way that I can optimize this? And so while if you opened up some of our you know if you were, like, trying to find the most beautiful optimized SQL code written with the best DBA, like, you're probably not gonna find that because, 1, the stuff changes very often, and, 2, it is much more effective for our business, for the analyst to be able to make changes quickly and understand what the changes they're making without needing to interact with, you know, an an engineer to get something done than it is to have the most optimal quickly running low cost things. Like, obviously, there's work we can do in the background. We can, you know, negotiate better costs, optimize spot, and things like that that we don't need them to see.
But a lot of it is, you know, really understanding what the analyst's core goal is and what information they need to be successful there, and then giving them the tools or the information so they can make those decisions themselves.
[00:50:19] Unknown:
Being able to surface the cost information, I think, is very interesting. I'm curious what your approach has been to being able to build out that cost allocation capability and building out metering for being able to surface that information to the analysts and what stages of the pipeline are contributing how much to the overall cost?
[00:50:40] Unknown:
Yeah. So this has been a passion project of mine. I care deeply about the cost allocations for that very kind of explanation I was giving around ownership. I think it's the best way to optimize tech cost is to find out what you actually need and what you don't. And so rather than hire an army of data engineers to kinda try to run around chasing the analysts, optimizing queries they wrote in the past, I'd rather tell the analysts how much it costs. And then they can decide, hey. Do I need to run that query? Can I run it once a week instead of once a day? And so that's been kind of the end goal is to try to expose that information. In terms of how we've exposed it, well, we have a few different sources of tech costs. So, obviously, AWS is, like, the core cloud platform that we use. They have some fantastic cost and usage reporting that you can set up. So we have hourly data dumps of, you know, billions and billions of rows of RK files. And they'll tell you exactly how many services you've used, how much cost they've generated. You already brought up metering. So, you know, for our ReadyPipe platform, that's exactly what we do. We figure out, you know, how many hours are you responsible for from gigabyte hours and vCPU hours. And then we calculate an internal metric called RPUs, ready pipe units. And then we allocate all of the related costs, pro rata to, to all of the different kind of projects that have been created on that platform.
Then we do a similar thing on the Databricks side where we generate a lot of cost for just the overall data analysis and compute. And so we do a similar process where we sort of figure out who's responsible for those costs, and then we allocate it as such. But I think it really comes down to, you know, the first step's data collection. So you have to get all the granular cost and usage data. Luckily, for platforms like AWS, Databricks, they do a good job of giving you the data. Right? But I think a lot of companies probably don't spend that much time. They might have, like, an accounting team looking at that data here and there. Whereas, what we really tried to do is build sort of like a data pipeline. You know, taking all of the lessons that we've learned for building data products, we sort of built, like, an internal data product. Take all of this raw cost and usage data, build up a series of transformations to allocate it across our organization.
And then, you know, we expose that information through dashboards and through just tables so that teams can use. We've even trained our internal finance team to use Databricks, and they're doing a lot of sort of traditional business intelligence work. And they're writing SQL, even PySpark in some cases. So I think that's another really cool thing that we've seen happen is that we want to take the lessons learned on the product side and move them to other operational teams where you don't need to work in Excel and Google Sheets only. Like, you can train people who work in finance, who work in sales operations. You can train these people to use Databricks, to do real data analysis.
And it's just as impactful as it is on the product side, which has been really kind of amazing to see over the last year.
[00:53:34] Unknown:
In terms of anybody who is responsible for managing or leading a single group of engineers or an overall data team? What are some of the pieces of advice that you have about how to think about the organizational and technical aspects of managing the life cycle of data projects based on the experience that you've gained from working at YIP Data?
[00:53:54] Unknown:
So I think, of course, we keep coming back. It's all about understanding the goals that the person is trying to accomplish and the ownership you can give them and then kind of trying to take the pieces that don't fit with that and get them out of the way. So, you know, if you're running a team of data engineers and data scientists, your data scientists should never be worrying about getting the right version of Python or Java or whatever so that they can run the platform they're on and, like or, like, something like that. You should be really focusing on, okay. I know that they need a tool to, say, monitor what experiments they're running and the differences between those. Like, if they don't have 1, you're gonna help them get 1 set up or you're gonna implement 1, something like that because most important thing for the business is getting to the results. It's not about necessarily using the best technology.
A decent technology now that allows people to work unimpeded is way better than the best technology 3 months from now or a year from now when you finally get it implemented. Like, we can certainly can't say that every decision we've made is perfect in a vacuum or that we make the exact same decision now, but it is pushing along those lines of making sure that people can do the work they need to and that you're really keeping the person who cares about the result as the same person who's actually able to do the work. And I think for data engineering teams at more traditional companies, it's a fine line between data engineering being kind of a team that you hand off tickets to to say, oh, we wanna transfer this thing, to having your data engineering team be kind of the 1 helping empower and build out these systems and, you know, helping the people who are actually looking at the data farther down be able to do that analysis themselves. And data engineering is more just kind of shaping the meta tooling around it because you're all it's always gonna be more effective to have to basically train an analyst with enough engineering to be effective than it is to get an engineer to be really interested in analysis because I think there's kind of pulling in 2 different directions.
And we give a little bit too much weight to saying, oh, engineering is too hard for the analyst. Like, to be honest, it is just a matter of getting them over that hump of being able to learn how to program, how to automate. It really opens up a lot of opportunities for them. The joy you'll see by being able to show someone how just to, like, schedule these jobs automatically so that, like, they don't have to wake up at 5 in the morning so that the job is done by 8 AM is like the tip of the iceberg in getting people to kind of feel empowered by these systems. And it's much more thinking through how can you make them more effective and get the painful parts out of the way so that they can, you know, then do even greater things on top of that as opposed to spending all their time being, you know, focused just on 1 particular tool but being less efficient overall.
[00:56:22] Unknown:
Are there any other aspects of the work that you're doing at YIPA Data or the platforms that you're building or the organizational structure and the benefits that you've seen from it that we didn't discuss yet that you'd like to cover before we close out the show?
[00:56:34] Unknown:
I guess a question for you. I think, like, having us talk about our experience at Yipit, we're a smaller company. And I think that working with the data that we work with, 1 of the things that is nice is that when you're working primarily with web scrape data, you don't have to worry as much about everything being so precise. You know, a lot of the work that our analysts are doing, we're trying to get insights from data, see kind of inflection points, general trends. It doesn't always have to be perfect. And I think because of that, we've been able to sort of give a lot of power to the analysts and, you know, trust that they're gonna do what they can to, you know, make their data products as as good as possible. And we don't worry so much about maybe losing some precision here and there. So I'm curious with your experience talking to so many other companies, if there's any sort of lessons learned for larger companies. Because I know if you talk to a data engineer who's working at a huge organization, it doesn't seem so easy to just kind of see that control.
There's more of a need of having, okay, we're gonna have this pure data engineering team, and they're gonna write all this Scala code. And then we're gonna have this other data science team that takes that and, you know, adds another layer on top of that. And so it's a little harder to kinda remove that friction. And I'm just curious to hear your thoughts, just being someone who kinda talks to a lot of different data engineering organizations.
[00:57:52] Unknown:
Yeah. From my experience of talking to people and the overall trends that I've seen both in my own work and from running the podcast is that the key point is really understanding what is the actual interface between the engineering team and the data scientists or the data analysts where, in your case, the interface is the platform layer that then lets the analysts actually perform the web scraping and own it all the way through to delivery. Whereas for another team that is maybe larger and has more complex data needs and they're trying to build sophisticated machine learning models, that interface might be the feature store, which is a piece of tooling that's becoming more prevalent now where the data engineers are responsible for being able to integrate the data sources, provide streaming interfaces, provide batch interfaces, and, you know, having access to the data lake and then being able to expose that to the feature store where the machine learning engineers and the data scientists can then create features based on the data sources, and then they can share that and own their workflow from there forward and being able to have the capacity for building the models and bring them into production. And so there's always a point where there's that 1 interface that works well between the data engineering team and the data analysts or data scientists. And so I think the really key piece is just figuring out for your own team and your own use case, what is that interface and how can you optimize that so that it reduces the overall friction between the teams.
[00:59:21] Unknown:
That makes a lot of sense. Yeah.
[00:59:23] Unknown:
Is there anything else that we didn't discuss that anyone thinks that we should cover before we close out?
[00:59:28] Unknown:
Nothing is coming to mind right now. I'd like to thank you for having us on. I think this is very enjoyable, and we're certainly always happy to chat with more people about this stuff because I think we have a particular perspective kind of on how our company grew up and came around. So we're always looking to hear different ideas, different ways of doing things because you can you know, it's very easy to understand more about how Netflix is doing something because you see a lot of their presentations, but it's hard to necessarily get that same information. More companies your size are in different positions.
[00:59:53] Unknown:
Well, for anybody who does want to get in touch and follow along with the work that you are doing or share some of their experiences of building out similar products, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. If you wanna start, Andrew.
[01:00:15] Unknown:
I think getting back to your question of being able to kind of trace lineage of data and both, you know, split and experiment with it, but also be able to see where it's coming from. You know, these large, medium, big datasets, you know, it's very easy to add a column onto it to say, hey, this is where I started from. But as you especially as you start building these more and more complicated ETL pipelines, it's a lot harder to say at the end of it. The reason that this data point is like this is because of this initial 1 kind of, you know, pulling back the curtain, kind of seeing where it came from. It's helpful helpful for understanding the value of what datasets you're pulling in or even where mistakes entered into the process. And right now, at least for us, it's a very, you know we have to look at it with a relatively poor set of tooling just because we're looking at rows in tables and there's no greater construct around that right now. It's something we're certainly interested in figuring out a better way to do and at least a lot of the places I've talked to, trying to understand better options for that are like, it seems to be a very popular topic right now in the data engineering world is, you know, how do I understand where errors are being introduced, how this data got the way it is, when I'm when I still want the freedom to be able to kinda define free form transformations and kind of, you know, have essentially the power of SQL or some other, you know, PySpark style transformation languages that don't necessarily have a great way of expressing this.
[01:01:32] Unknown:
Anoop, what are the gaps that you see?
[01:01:34] Unknown:
Yeah. I think a big gap that I see is really around the problem of distributing data, and I think more distributing data externally. We definitely experienced this when we, you know, work with various third party data vendors as well as kind of our recent more push into delivering data to clients more directly. There isn't really an excellent way to distribute data. I mean, there's all kinds of considerations, whether it's the size of the data, the format of the data, how the client's interface for ingesting that data is set up, the permissioning. And on top of kind of, like, things that Andrew kinda touched on that if there's, like, an issue in or an error in that dataset that's being delivered externally, how do you propagate that fix across your various clients? And so we've definitely seen a lot of different patterns across vendors, but but it's always been kind of a struggle. I don't think we've necessarily seen that there is 1 clear cut way to do this reliably, securely, efficiently. So I think there's just a lot more to be done here on how do you actually propagate data across different organizations.
[01:02:35] Unknown:
And, Bobby, how about you?
[01:02:37] Unknown:
I spend a lot of my time working with third party data and working with different companies and figuring out, you know, how are we gonna get data from your data lake to my data lake. And there's no good way to do it right now. It doesn't seem and everyone has their own sort of setup preferred way to distribute that data. And every provider I work with, it seems like we end up at a place where there's gonna be a data engineer sitting somewhere writing a script that's gonna dump Parquet files to an s 3 bucket. And when that's your setup, inevitably things go wrong. And we've seen companies like Fivetran pop up that have done a really good job sort of creating a way to, I guess, more easily set up data ingestion for things like SaaS platforms. Platforms. But when you're dealing with just other data vendors, there's just not that same level of kind of uniform conventions that you can rely on to kind of agree this is how we're gonna get the data from you to us. And that's something that we're still kind of struggling with, I would say.
[01:03:31] Unknown:
Well, thank you all for taking the time today to join me and discuss the work that you've been doing at EBITDA. It's definitely a very interesting approach to managing the lifecycle of data products and a lot of interesting tools and challenges that you've been facing. So thank you for all the time and effort that you've put into that and for taking the time to share it with all of us, and I hope you each enjoy the rest of your day. Thanks. You too. Thanks very much, Tobias. Yeah. Thank you very much. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introductions
Andrew Gross on Data Management at Yipit Data
Bobby Muldoon's Journey to Data Engineering
Anoop Saghu's Introduction to Data Management
Overview of Yipit Data's Evolution
Team Structure and Analyst Empowerment
Handling Data Quality and Schema Management
Data Lineage and Metadata Management
Platform Architecture and Evolution
Guiding Principles for Building Analyst-Friendly Systems
Lessons Learned and Challenges
Final Thoughts and Questions
Biggest Gaps in Data Management Tooling