Summary
AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Lindsay Pettingill Chetan Sharma, Swaroop Jagadish, Maxime Beauchemin, and Nick Handel about the lessons that they learned in their time at AirBnB and how they are carrying that forward to their respective companies
Interview
- Introduction
- How did you get involved in the area of data management?
- You all worked at AirBnB in similar time frames and then went on to found data-focused companies that are finding success in their respective categories. Do you consider it an outgrowth of the specific company culture/work involved or a curiosity of the moment in time for the data industry that led you each in that direction?
- What are the elements of AirBnB’s data culture that you feel were done right?
- What do you see as the critical decisions/inflection points in the company’s growth that led you down that path?
- Every journey has its detours and dead-ends. What are the mistakes that were made (individual and collective) that were most instructive for you?
- What about that experience and other experiences led you each to go our respective directions with data startups?
- Was your motivation to start a company addressing the work that you did at AirBnB due to the desire to build on existing success, or the need to fix a nagging frustration?
- What are the critical lessons for data teams that you are focused on teaching to engineers inside and outside your company?
- What are your predictions for the next 5 years of data?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on translating your experiences at AirBnB into successful products?
Contact Info
- Lindsay
- @lpettingill on Twitter
- Chetan
- @chesharma87 on Twitter
- Maxime
- mistercrunch on GitHub
- @mistercrunch on Twitter
- Swaroop
- swaroopjagadish on GitHub
- @arudis on Twitter
- Nick
- @NicholasHandel on Twitter
- nhandel on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Iggy
- Eppo
- Acryl
- DataHub
- Preset
- Superset
- Airflow
- Transform
- Deutsche Bank
- Ubisoft
- BlackRock
- Kafka
- Pinot
- Stata
- R
- Knowledge-Repo
- AirBnB Almond Flour Cookie Recipe
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Siflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Siflae.
Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae today. That's
[00:01:47] Unknown:
s I f f l e t. Your Your host is Tobias Macy. And today, I'm interviewing Lindsay Petengill, Chetan Sharma, Swaroop Jagadish, Maxime Boeschman, and Nick Handel about the lessons that they learned in their time at Airbnb and how they're carrying that forward to their respective companies. So, Lindsay, can you start by introducing yourself?
[00:02:05] Unknown:
Sure. My name is Lindsay Pettengill, CEO and founder of a company called Iggy. I was at Airbnb for 5 years as a data scientist and started Iggy a little over 2 years ago where we focus on helping companies build great products with location
[00:02:24] Unknown:
geospatial data. And, Che, how about yourself?
[00:02:27] Unknown:
Yeah. I'm Che Sharma. I'm the CEO of EPO. We're an experimentation framework that's built for the modern data stack. And Swaroop?
[00:02:35] Unknown:
Hey. I'm, cofounder and CTO at Acryl Data. We are driving the open source Data Hub project, you know, the larger ecosystem around data governance and data observability and so on. Before this, I used to lead data platform and search infrastructure at Airbnb.
[00:02:51] Unknown:
And
[00:02:52] Unknown:
Max? Hey. I'm Max. I'm CEO and founder at a company called Preset. We're commercializing Apache Superset, which is a project that I started, an open source project I started while I was at Airbnb in 2015, I believe. I worked with these fine folks here on the call today back then and also created started Apache Airflow back then. So that was back in 2014. And it's Nick.
[00:03:18] Unknown:
Hey. I'm Nick. Sorry. Bit of a cold. But I'm CEO, cofounder. Spent about 4 years at Airbnb, 2014 to 2018, so overlapped with this fine group as a data scientist and then a product manager on data tools. And transform is a metric store that enables teams to define metrics and consume them consistently across a variety of data tools.
[00:03:40] Unknown:
And going back to the beginning, Lindsay, do you remember how you first got introduced to working in data?
[00:03:45] Unknown:
Oh, man. For sure. Yeah. I was just reflecting on this. Back in 2003, I did an internship at Deutsche Bank in Frankfurt, and they had bought a number of or merged with some Italian banks. And I was on a team that was responsible for essentially merging in a bunch of employee level data. And there's a big learning curve. Let's just say that. The only tool I had at my disposal was Excel. But that's really where my journey started. As soon as I started working on that project, I was just, you know, fascinated by data and picked up a lot of a lot of skills thereafter. But but that was where I got my first taste.
[00:04:25] Unknown:
And, Jay, do you remember how you got started in data?
[00:04:28] Unknown:
Yeah. Absolutely. I studied engineering and stats undergrad and masters, and then I worked in health care policy for a little bit of time, but most of my formative career is at Airbnb with these folks. So that was where I got to do things like machine learning and setting up a bunch of analytics infra and, you know, building all sorts of data tools.
[00:04:44] Unknown:
Swaroop, how did you get started in data?
[00:04:47] Unknown:
It was a while ago, but I was involved in writing data pipelines for handling all the search advertising data at Yahoo back in the day when Yahoo was still good at search. You know, joining large amounts of search and click events and enabling fraud detection and those types of use cases on top of all this massive amounts of data back when Hadoop didn't exist and that we built a system that was a precursor for Hadoop. Then I went on to build a source of true database at LinkedIn then the the foundation for,
[00:05:20] Unknown:
you know, data at Airbnb back when data culture was still being formed. That's been my journey. So so I didn't know you you wrote data pipelines at any point. I think you've been good at hiding that from all of us. Yeah. Yeah. I didn't you let if you let anyone know that you've built data pipelines before, they're like, there you go. Now go build more. Go fix the pipelines. Go be a plumber. But, yeah, I also was at Yahoo at some point, but I go further back to than that. So I started that company called Ubisoft, a little again company, and then they were building their first data warehouse in about, like, 2, 000, 2, 001.
And that made me a data warehouse architect and business intelligence engineer. The stack was the Microsoft, like, the I stack at the time. So we could make Microsoft SQL Server 2, 000 and a little bit of HyperNS based analysis services. So kinda first gen BI and then, you know, joined Yahoo, where was the birth of Hadoop's a little bit later, maybe after Swaroop's in 2, 007, 8 was when, like, the Hadoop story was starting to come together. The pre Hadoop world is even worse than Hadoop, if you can imagine that. You know? There's something called Mina and then these really bad distributed, like, data lake databases that were. Mine is what I worked on. Yeah. It was, like, practically unquariable, and we're using Pearl at the time. And then we went to Facebook where I think, like, for me, that's where my career really started. Like Facebook 2012 is really a golden era. I think a really fun place to be and where a lot of, like, what we see in the modern data stack today kinda like, some of these ideas were already kinda bubbling internally there at Facebook back then.
[00:07:00] Unknown:
Yeah. Distributed Pearl just sounds like a foot gun that points both ways.
[00:07:04] Unknown:
And, Nick, how did you get started in data? Didn't really know exactly what I wanted to do, start my career, but I studied math. And so I went into macroeconomic research, joined BlackRock, and spent a few years there before I moved over to Airbnb. It was kind of interesting because I feel like I got, like, 3 generations of data stacks in kind of very rapid succession going from BlackRock to Airbnb and then kind of now working in this modern data stack world. But, yeah, Airbnb was really when it really when my kind of career formed. And originally, I was a data scientist then moved over to product, recommend data tools.
[00:07:43] Unknown:
And so as you've each mentioned, you all had some overlap in your time at Airbnb and have then gone on to found your own respective companies. And I'm wondering whether you think the fact that you went from Airbnb into launching, you know, new businesses and focusing on your respective areas of the data ecosystem. Is that an outgrowth of the fact that there was a good amount of kind of data oriented culture and work environment and curiosity that you wanted to explore further, or was it more an outgrowth of the fact that you felt a particular pain in your time at Airbnb and you wanted be able to solve it for other people?
[00:08:24] Unknown:
My story, John, is, like, very, very simple. Right? I started 2 open source projects, and the VCs were like, oh, these projects are really taken off. And then commercial open source companies are successful, so they kinda summed it all up and were like, dude, would you like to start a company? I was like, I don't think so. But then, you know, spoke with a set of founders at the time and got convinced that it was the right thing to do. So it's really for me, it was a deal on source traction that was the kicker.
[00:08:50] Unknown:
I think in general, there's just a lot more data tools companies out there. The ecosystem has really blossomed and, you know, the bunch of secular trends that have led to it. So I think there's part of the timing is that just, you know, we all left Airbnb sometime 2017, 2018, and the moment was ripe. I think there was also something to Airbnb itself, which is, you know, we really, you know, prized entrepreneurialism. I think Brian Chesky, for a while, only hired ex founders as PMs for, you know, kinda talk about how far that wins. And the other thing is that I think from a data standpoint, Airbnb just had a lot in common with a lot of the companies that were gonna kind of end up being big data consumers in that we had different types of data. So we had event streams, which is the type of thing that Facebook's were really built on. And then we also had transactional tables that were the root of a lot of DAGs.
And so we had purchases as the main backbone of metrics, which is a lot of ways other companies are operating. And so I think the kind of entrepreneurial culture, the nature of the data, and just the timing kind of led to a lot of people going entrepreneurship.
[00:09:56] Unknown:
But I'm curious if you all agree with that same take. Yeah. That is, I think, the couple of factors. 1 is seeing how quickly things got done at Airbnb for the, most important problems and delivering that business value, seeing how it can be replicated outside of Airbnb. But I was also, kind of the last person standing at Airbnb after all these folks left. So I saw a bunch of things that didn't necessarily go well at Airbnb 2, specifically when they tried going public and after COVID hit. So in terms of the weak spots, essentially saw that data practitioners were having to deal with all kinds of questions related to SOX compliance, you know, IP readiness, worrying about cloud cost efficiency, and these types of things, I saw clear opportunity where practitioners should not be dealing with these types of questions. And in spite of all the great progress they've made, there's still a lot to be done in this space. That's the reason I got started. We reacted to that in different ways. For me, I was like, oh, all the IPO stuff and the compliance stuff is coming. I gotta get out of here. Yeah. Then,
[00:11:05] Unknown:
part of it was like, oh, there's probably a business idea and all this SOC 2 and compliance stuff.
[00:11:09] Unknown:
Yeah. I think at the core, like, 90 I think there are 90 something founders that came out of Airbnb. And that's just, like, the number of, you know, somebody who aggregated a list, like, a year ago. So it's probably more at this point. So I think at the core, it was a very entrepreneurial company. I think the thing for me was that I went to a startup after Airbnb. And I think 1 of the things I saw was that Airbnb's data stack was actually very similar to the way that the kind of modern data stack works today. But, obviously, like, a whole different set of tools, a ton of stuff built in house. But just the way that the company was working, I think, was actually pretty similar to the way that the modern data stack works today. And so in some ways, I think it felt, you know, a little more natural for some of us as we transitioned to trying to start companies in this in kind of the broader ecosystem.
[00:11:59] Unknown:
Yeah. Another piece, like, I'm curious if you've seen the same things. But in retrospect, Airbnb was pretty successful at centralizing a lot of data workflows. When I now look at these much larger companies, even if they have investments in things like experimentation, it's like every team's got their own thing. You know, we had a little bit of that with machine learning, which is pretty hard to centralize Airbnb. But in terms of analytics workflows, like, I think we did a pretty good job of getting people on a centralized, vetted, kinda governed set of systems. And, you know, ultimately, that's what you need to do a data tools company is, like, generalizability.
[00:12:33] Unknown:
I think it's interesting that you call out the, separation of the analytics and the ML workflows because another company that has spawned a lot of different companies and businesses out of it is Uber, which a lot of those have been focused more on the ML side of the house, which is obviously a very kind of core competency that they invested a lot of time and energy in. And I'm wondering how you view some of the kind of outgrowth of, you know, some of these large tech organizations and how they impact the broader ecosystem around them be both because of the work that they do internally and share through various sort of blogs and conferences, but also through the people that learn and then leave those companies and start new ventures such as yourselves and, you know, a lot a lot of you are very kind of analytics oriented. Just how that kind of core company culture then creates those kind of ripples throughout the ecosystem.
[00:13:28] Unknown:
This is something I've definitely noticed is that each alumni kinda has their set of, you know, values and baggage. Like, in particular, Uber is an interesting case because, you know, for the experimentation thing, they weren't able to really centralize metric definitions for a long time, which was a core part of how the Airme system was built. And so, you know, they ended up with a bring your own metric system and experimentation, where it's kinda embraced that everyone's gonna define things slightly differently, which is like anathema to how kinda every me operates. And, you know, you, you know, from Facebook, I've noticed everyone kinda like people love scuba. You know, that system seems like it really was
[00:14:03] Unknown:
and they're like, why can't I have scuba everywhere? And I don't know if that other alumni have as much appreciation with that type of tool. I think, like, people get the opportunity to have these fast growing company to, like, rebuild something from scratch that they might have seen at a previous company. Right? So maybe you come from LinkedIn and you saw, like, some little project that was done kind of well, but not perfectly, and you're like, okay, now we were gonna need something like this here and I get to rebuild it so that was a little bit part of my story. Something interesting about like their Airbnb and Uber relationship, I remember like many of us went to share with Uber and they would never come to us because I think of like NDAs or like I don't know, there was like some like legal restriction where it was generally the Airbnb folks that would go to Uber to to like, you know, have like sharing session and that it was, like, really often, like, you need directional. And we'd come back and be like, yeah, that's weird. Like, they asked me a bunch of questions and took notes and spun me around and sent me back home. I don't know. That was my impression. Like, very, very different cultures between the 2 companies, but I guess in both cases, there's a lot of similarities too. Right? Like, in terms of, like, building a lot of stuff from scratch and engineering driven culture, strong at data.
I I think fundamentally to Uber was just more real time, so much more real time than Airbnb and much more like inference in real time too so a lot of like ml systems that need to be like real time so that maybe define
[00:15:30] Unknown:
where they invested and the kind of companies that came out of there. In my case, I've seen this happen at 2 companies so far, like LinkedIn, obviously, Kafka and Confluent and Apache Pino and StarTree. There's a bunch of there's a rich history there. Still, it happened at Airbnb. There's a, you know, a certain amount of fearlessness that you need when you take on big hairy problems. And there seems to be funding and fearlessness for these types of things in these companies because it's so kind of tied to business outcomes. Sometimes you can take it too far and start to say this is how the entire industry needs to solve all problems.
That may or may not be the case every time. So we just have to balance our, you know, kind of fearlessness in tackling some of these problems with what the mainstream needs sometimes. I would say that's the only
[00:16:18] Unknown:
counterbalance, maybe. Yeah. I I think that that's definitely something worth building on, the idea that sort of 2 things there, where 1 is because you're working at these large organizations where you're running into problems that other people aren't liable to encounter because of the scale and the fact that because it's a large company, you have the available engineering expertise and the resources to invest in doing these research projects and figuring out how do we solve these problems using engineering. That's a way to actually acquire the necessary lessons to then go on and be able to say, okay. Now I'm going to start over because I already know all the things I need to know that I didn't know when I started the other project, so I can do it differently now. But then also the potential for maybe overengineering things where you say, that's how I had to do it at LinkedIn, at Airbnb, at Uber, whatever it is. And so now I'm going to go and build that for everybody, but I'm building it with the experience that I had at that large organization. So I don't necessarily know how well that maps to, you know, these small start ups or companies that aren't necessarily purely software and tech oriented. I'm just curious what some of the experiences that you've all had as founders going into that space have been of, I learned this lesson the hard way. Now I'm going to, you know, save everybody else that pain, but then realizing that they're not encountering the pain the same way that you did because of their different circumstances.
[00:17:39] Unknown:
Yeah. I think 1 thing you can tell from this crowd is a lot of us did multiple radios here. And so, you know, you you saw the kinda end of the book, and then you also went to other start ups and started from beginning again. And I think that kinda breeds certain empathy for people at different stages of maturity. For me, it was very telling to go from Airbnb to these kind of series a, series b startups with, like, 1 or 2 data people trying to bootstrap, like, very complex workflows off off the shelf tools and limited headcount.
[00:18:09] Unknown:
Yeah. I would say that there's definitely a big challenge around kind of generalizing. Even if you've gone to a few companies, like, trying to understand more generally what the problems are that the market faces, it's a huge challenge. Right? It's like you can't go to enough companies, you can't work in enough companies to get enough perspective. And so I think that there's the generalization piece, but there's also just the kind of shift of building tools inside of a company and building tools in a kind of competitive market. Right, where there's a lot of dynamics that you have to be able to respond to.
[00:18:43] Unknown:
That's probably some of the biggest lessons that that I've learned. There's a huge difference for me. Yeah. To between, like, building an internal tool. Right? Like, then you just, like, talk to your neighbors and it's like people are very accessible and, like, you see what they're doing on the product you're building live, or they're sitting next to you and the feedback loops are super tight. And then there's, like, the generalization issue and the quality issues of when you, you know, you actually sell your product to a broad set of people and needs to be tight and solid and high quality and tested, and it needs to work for everyone.
It's a lot harder to build in that context. It's a lot slower. It's not necessarily harder. It takes more resources to make a step forward where you're carrying, you know, a 100 customers on their shoulders than if it's just, like, your buddies that are sitting next to you and you're approving each other's, pull requests.
[00:19:33] Unknown:
I was having some technical challenges to share in that. I think 1 of the differences, like, between me and and some other folks here, you know, I built something that didn't exist at Airbnb. Right? Like, that hadn't been built. You know, I mean, my team are still trying to will it into being. And I think 1 of the challenges that is both associated with that and associated with being at a company like Airbnb where, you know, all of us worked on and built data tools, but the ends of them were very product focused, you know, for the most part. And 1 thing that I have, you know, been routinely shocked by, and it's no longer shocking, but it is, you know, it's kinda slow to accept some of the lessons that you encounter as a founder is the way that I thought about data, the way that our data science team, our data analytics team thought about data, thought about leveraging data, thought about leveraging tools for data is not that common, not, like, normally distributed amongst companies. And so, like, it was a blessing, you know, incredible blessing to, like, be a part of this team, you know, and take those lessons to build a company. However, it's, like, so challenging when you realize that, like like, we joke at Iggy that, like, sometimes our customers aren't as creative as we expect them to be because we sometimes take for granted, you know, what it's like to be in an environment like Airbnb where, you know, again, leveraging data and tools to improve the product was was, like, such a part of our being. So
[00:20:56] Unknown:
On that vein of kind of the culture being an important element of the ways that you're able to take advantage of data, I'm wondering what were the elements of the data culture at Airbnb that were done right and that were invested inappropriately. And what were some of the critical decisions and inflection points in the company's growth and their journey of of how they collected and took advantage of data that led you down that path.
[00:21:25] Unknown:
A few things I'd like to talk about that are, like, kinda unique to Airbnb. Well, not unique, but, like, there was a huge investment in data. Like, they built a data team from scratch, and it went from, like, a handful of people to, like, more than a 100 people very quickly. So they invested in data infrastructure and data science and across the board, like, data engineering. So very heavy investment and a lot of freedom around, like, building custom things. Not every company can afford that too. So, you know, place that we were in that place of super high growth. But 1 thing I think that was interesting coming from Facebook and Yahoo and places like that is that the data science team was largely, like, brand new. I guess there was no, like, a 100 data scientists we could hire at the time, so the strategy was to make him. Right? So we're like, we're gonna make our own data people, and we hired probably out of the 100 data scientists that were there around that time, many were like PhD fresh out of school, super smart, but also like not classically trained in data, which brought both, like, a new set of eyes and new set of approaches, kind of fresh take on things.
And then which is really positive thing, but so maybe that that allowed us to to challenge and the the establishment and do things in a new way. But I think 1 thing we did really well that I think I haven't seen done as well in many places is the education component and the data university. So there's a blog post on it, but there was, like, definitely investment around teaching the data structures, the internal, like, datasets and the data tools to everyone internally. Like, teaching SQL to anyone who's interested to learn SQL, putting tools in front of them, data 101. I forgot what all the classes were, but they were, like, SQL 101, Data 101. I remember teaching Airflow and Superset classes, you know, weekly. There was just kind of this conveyor belt of, like, training everyone, pushing data literacy up in the organization, which was, like, super great. The company had very clear
[00:23:20] Unknown:
goals in terms of business goals and how it was growing. And even though the leadership was famously design led, the next layer was actually very, very data literate. I remember folks like, I don't know, Joe Bot and there's just few names that only the Airbnb folks will recognize. But the next layer of leadership were very data literate, And, you know, there were some strong product managers. There were some strong, you know, data scientists. They had the lead for all of data science was really influential in terms of, you know, influencing the executive team. So all that played a big role. And then there were a few key data products like search and pricing.
Those were inherently the magnets for, you know, strong experimentation, strong metrics driven analysis, and bringing the best machine learning folks. Some centers of excellence started to emerge and very quickly because of, you know, people coming from mature organizations. There was this focus on creating core data, which is, like, you know, meant to power all of these different outcomes. So I haven't seen so many mistakes being made at Airbnb compared to what I've seen in other organizations when you lay the foundations because a bunch of us had seen it before, and we didn't allow for the same types of mistakes to happen. But it doesn't mean that we didn't get it. You know, something's wrong. I'll get to that in a minute. But I think we kind of benefited from a lot of good things happening in a very compressed time period at a time when growth was, like, skyrocketed at Airbnb.
[00:24:54] Unknown:
Yeah. Completely. I feel like there's a huge org strategy element here, and I can go a little bit further back to before we made the big investment because that wasn't always the case. And, you know, with the first 5 to 10 data scientists back when there wasn't as much leadership investment in data, they were everyone is very good at communicating, like, very good at, you know, taking complex topics with really unsound foundations and still driving impact with them. And that really led to a lot of buy in and seeing data teams as, like, a stakeholder that deserve to be, like, early in the strategic process rather than this sort of, like, lagging get me a dashboard type of function.
Once we start seeing a lot of success, such as on the search ranking team, you know, running these experiments or, you know, kind of early supply growth liquidity stuff. You know, that kind of opened the floodgates to say, like, we need data as, like, you know, it's no longer just a triad of, like, engineering design product. Now it's gonna be a quartet at every level, org, you know, that we're now gonna have leaders like a head of growth data, head of marketplace data. That kind of let every team build it into their DNA that they're gonna have this very data centric philosophy. And then the other thing we did, which I also don't see as much, is that despite being very, very embedded, we still were a force of centralization of these capabilities. You know, again, like you mentioned poor data. We all decided we have this this common metric definition.
We're gonna have a common experiment infra and reporting method. You know, we're gonna have a general set of governance for data pipeline building and management that, you know, eventually turns into something like a metric layer. I I usually see a lot of teams either, like, either not embed at all and end up with just kind of, like, an r and d, ML focus,
[00:26:36] Unknown:
or they get fully fragmented, but they and they never make these centralized investments. I agree on that. Like, the good balance between the 2, right, the balance between, like, you know, the vertical pool and the horizontal pool, the sharing. Like, 1 thing too that was unique, I thought from other companies I was at is like the culture consensus that drove me crazy at the time, because like everyone's feelings had to be in check all the time and everyone had to agree. Like, there's, like, this, like Yeah. Deep consensus culture, but I think it did create a place where everyone communicated, everyone was on board. The cost of that was a lot of time in meetings, like, hashing details.
[00:27:09] Unknown:
You know, I was 1 of those data scientists that Max mentioned that was hired first real job out of grad school. So I did a PhD in political science. I used Stata and a little bit of R. You know, no Python experience. I had barely written a SQL query. And, you know, 1 response to that could be, like, well, how could you possibly be hired as a data scientist? Right? The other response is, well, these are just tools that you can learn. Right? And I mentioned that because I remember when I joined, Airbnb had this CEO dashboard. And it wasn't just weekly metrics, it was queries behind the metrics. And I learned SQL by, you know, looking at those queries and trying to recreate them. And, you know, 1 thing that hasn't been mentioned is, data repo. Right? Like or knowledge repo, excuse me, at Airbnb was always 1 of my favorite tools. And 1 of the reasons why is that it just, like, exemplified to me this, like, curiosity sharing driven approach to, like, our product and data, whereby, you know, knowledge repo, you would, you know, commit code and and sort of tell a narrative back to Chase point about communication. And it was an incredible way to, you know, share what you were working on, but but in, like, a very lightweight way, like, get feedback from folks. And I don't know. It was just, like yeah. It just really exemplifies to me this attitude we had around the dissemination and sharing of knowledge that, I don't know, if if they're still using it. But, you know, someone who who was new to tech and and new to sort of data at the scale that we had at Airbnb, it was, like, an indispensable tool that really, like, welcomed me and and so many other people to,
[00:28:50] Unknown:
you know, different ways of thinking and doing data work. I think 1 of the things that I think the data team did really well to get buy in from the rest of the organization was build tools that got them more involved. And I think a lot of it happened really in 2014. And, Che, I'm actually curious because you saw the before of this and I kind of came out right in the middle. But, like, February of 2014 ish was core data, which was, hey, we've got tons of messy datasets everywhere. Let's make a core set of data sets that, like, anyone who can learn SQL can then go query. And that kinda led to this SQL education program called School, s q o 0 l, which I always thought was clever. And then I think it was around, like, July or August was ERF. People were building experiments, but this was this experiment reporting framework that allowed you to define metrics and kind of automated the data pipelines.
And that led to Minerva and a bunch of the stuff that Airbnb talks about now, which I think really kind of lends itself well to data accessibility and just getting lots and lots of people involved in the data. And then I think the next big 1 was Airflow. And, Max, you know the date better than I do, but, like, I don't know, October, November of 2014 ish. Right? 14. Yeah. Right around that time. Yeah. And, like, that took what was, I think, mostly a data engineering job at that point of kind of going and building a bunch of tables. And data scientists got involved every once in a while. Yeah. Like Lindsay, we would write cron jobs or something before that, but it was such a mess and they would Kronos. That's right. And they would fail all the time and nothing would ever work. And then Airflow basically brought in this entire group of people into that kind of data construction process. And so I think between those, you know, you had data scientists going and building better, cleaner tables for consumption.
And then you had good clean metrics and experimentation interface that, I mean, everyone on the product team. Like, when the product team is a 1, 000 people, like, everyone would go and look at that interface. And so I think it was just really, really getting, you know, the kind of product managers,
[00:31:03] Unknown:
the designers, the engineers, like, you know, the ops teams, like, getting everyone involved in the data that kind of justified more and more buy in that led to more and more better tools. Experimentation is super, super key there. Like, I saw the transformation from not being experimentation driven. I think when I first joined, there was, like, maybe 60 things in ERF that didn't work. Basically, the experimentation framework and where we did AB tests. And then we kinda unplugged that drain and over that period of 2, 000 15, over 2015, we became, like, super experimentation driven, which completely shift how you approach product development or if not all product development, but, like, product iteration.
And and then that became really central to, like, really core to a lot of the data work and, like, justifying all this investment in data to all of a sudden we could, like, prove, you know, how we're impacting product
[00:31:59] Unknown:
and engagement and growth. So we talked about a few of the trends that kinda came together there. You know, what's kinda interesting you're bringing up to today is that, like, now people have airflow, and they got DBT, and they have these, like, super elastic warehouses. So a bunch of those things that held us back from, you know, laying solid foundations are now just completely solved. But 1 of the things that came out of these investments in experimentation, which is kind of part of why I was excited to start a company, is that experimentation starts to get you down this path of good database modeling.
And I think through a lot of what Core Data did was say, let's take a lot of these concepts and make them atomic facts that can be aggregated in various ways. Let's have some kind of governance model around dimensions and say, like, what are we gonna split by? And when you look at a lot of these companies out there, like, I don't think they put as much effort into kind of extensible, reusable database modeling in the way that poor data effort did that ultimately led to all sorts of things blossoming.
[00:32:58] Unknown:
I see some of it. Yeah. I think sometimes. Yeah. And maybe less and less. This is my point about how Airbnb's data ecosystem is not that different. Yeah. Like, 1 of the main differences was that we just built a bunch of tools internally, but they actually closely resemble, or in the case of airflow, like, is a tool that everyone else uses. Right? And so I think that that's part of the reason why, you know, when we all went out and started looking at starting companies, we could just look around and see the ecosystem. It was familiar. Right? It wasn't like we were in a totally different world. And I think that a lot of big companies exist in a totally different world than the rest of the ecosystem.
[00:33:38] Unknown:
Yeah. I think that that's definitely worth highlighting where to your point, a lot of folks who I speak talk to who come from Google, they say, oh, at Google, everything was amazing. And then I came out of Google, and now I have to rebuild all the whole stack that I was used to at Google before I could get anything else done. Similar stories, I think, in places like Facebook where after a certain point, the world that you're in is so far divorced from the world that everybody else has to deal with that you can't work in the same way. So when you do go on to a different company or try to start a new business, you're trying to solve problems in a way that nobody else can recognize, and so it adds an extra friction. So for all of you coming from Airbnb, to your point, the it already resembled the way that everybody else was progressing to as far as how they address data. So there wasn't that massive culture shock or the massive kind of relearning of how to do things that you had to undergo before you could start building on top of it.
[00:34:32] Unknown:
1 idea I'm trying to push recently is or not idea, but there's, like, this idea of data native companies. Right? So nowadays, you start a company. So, like, the companies we all started here and we all started, you know, the day of our launch or even before that when we launched our beta product, we already had, you know, really good analytics. At least like in our case, that was we had a data warehouse, we had good metrics in dimension, we're logging everything. Short of like experimentation. We're doing a lot of the things that Airbnb did at the time, which is kind of fantastic. Right? Airbnb was not a data native company. It had to, like, use, like, blood, sweat, and tears to, like, you know, build the infrastructure that we needed to be data driven. But now that stuff is so accessible.
Right? It's cheap. I can sign you up for BigQuery, 5tran, DBT Cloud, preset, and all of each other's, like, lovely companies we have here and it gets started and have, like, a stack that is comparable in terms of what you can accomplish, you know, to Airbnb's data stack, which is amazing. That's huge progress. And Now what do people do now? I don't know. They need to go and embed other things or, like, be really good at using these tools as opposed to, like, making the tools.
[00:35:43] Unknown:
Yeah. Now now everybody spends their time integrating all those different tools in the layers and try to turn the modern data stack back into a vertical solution.
[00:35:51] Unknown:
Complain about what's wrong about airflow, you know, and why he is not perfect.
[00:35:55] Unknown:
Plenty of that going around. That's, like, mostly what I feel like the data team at our b BNB ended up doing anyways. Right? Like, we had all these different tools figuring out how to stitch them all together.
[00:36:07] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or over In fact In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer.
[00:37:11] Unknown:
And 1 of the most informative lessons that anybody can learn is when they start making mistakes. And there are all these things that have gone right at Airbnb that have, you know, given you each the lessons that you're building on top of and the tools that you're building in your own companies. But what were some of the mistakes or dead ends that you each made either individually or collectively that were most instructive for you that you carry forward into the work that you're doing today?
[00:37:42] Unknown:
I wanna kind of call out maybe 2 or 3 things in terms of what I saw were not necessarily great things at Airbnb, in terms of how we did data. 1 was as the company scaled and we went from a a simple business to multiple business units and so on, our foundations didn't really hold up as well. And, you know, people talk about decentralization and business units owning their data. We started to experience some of that early, and I don't think we navigated that really well. We talked about core data earlier that kind of started creaking at the edges once business units started forming.
We were forced to actually push some parts of the foundational datasets into business units because that's where the funding came from for data engineers. And you see that happening in the industry today. Like, the funding for analyst headcount, the funding for data engineers comes closer to line of business, not so much the central teams. The central teams are very lean now, and we didn't quite navigate that piece really well. What we should have probably done is layout standards and frameworks which allows the line of business folks to own datasets according to governed standards. People talk about data mesh and all these things these days to be it's all very buzzworthy, but the bottom line is I think as an industry, we're still struggling with that problem and we had that problem. That was 1. And I saw an unnecessary forking of analytics and machine learning at Airbnb.
The metrics standardization effort mostly powered analytics and experimentation, and the ML folks went off and did their own, you know, feature repository and, you know, there was, like, a reinvention happening at every layer there. I don't know. Nick is smiling away here. But Yeah. I've got some thoughts on this 1. Yeah. Yeah. Yeah. Yeah. Well, some of it is was necessary. Don't get me wrong. We had to do what we had to do before
[00:39:39] Unknown:
realizing the impact. I don't think it was necessary.
[00:39:42] Unknown:
Well, fair enough.
[00:39:44] Unknown:
It's just when you have a legacy system and somebody's telling you to build, you know, a new thing for ML, people don't wanna change, urge ideas.
[00:39:51] Unknown:
Yeah. I think, strong first principles thinking on the ML side was missing. We were stronger on the analytics side. And I would say, finally, there was a come to Jesus moment when IPO happened in terms of the amount of panic that set in due to poor data quality for all the top metrics that were being, you know, closely monitored and felt like we had somehow completely missed a trick there. So in terms of making sure data quality standards are being enforced right from we got to a much better state eventually, but there was a lot of push and there was a lot of blood, sweat, and tears drop. So I those 3 things stand out to me in terms of where we clearly missed a trick.
[00:40:36] Unknown:
Yeah. I mean, I think the biggest problem that Airbnb had around machine learning was that there were 3 core applications that were just absolutely critical to the company's success. There was pricing, search, and then there was kind of trust and safety applications of ML. And that business couldn't succeed without those applications being built out early. And there was no centralization of the way that those different teams built their infrastructure. And so they built huge amounts of infrastructure and grew to be, like, 10, 15 person teams each supporting their own infrastructure. And then the organization said, you know what? We have a 4th application from L and M. It's not just a 4th. We've got, like, you know, a 100 4th applications from Elle, and we need to build centralized infrastructure for all of those other ones. But at the same time, there was not buy in from all of the, like, 3 main applications to merge on to centralized infrastructure.
And so you kind of just ended up with, like, a 4th version of of ML infrastructure within the company. And then related to, like, I don't know, the feature store, metric store thing is really interesting. They are very different, and you probably could have merged a lot of ideas between them. There are different requirements for each of those systems, and we didn't end up merging them because our metrics applications at that point, you know, that was like 2018 and the metrics application started in 2014. And so we were like 4 years down this path and trying to merge those things was just impossible
[00:42:07] Unknown:
at that point. That's a really good, like, question, though. Like, should the feature store, the metric store kinda be built off of the same, you know I have a lot of shared parts is a really interesting question that is unrelated to the discussion here. But
[00:42:22] Unknown:
Yeah. You should have heard my pitch for transform in 2019. I I tried this idea of this derived data repository. And I think that I'm actually, you know, glad that I didn't work on that because I think that that is such a broad set of, like, requirements that I don't know that you could really build that as, like, a commercially viable piece of, like, software. Fundamentally,
[00:42:44] Unknown:
I also also, like, data scientists won't agree. You know, I don't care about the same thing as, like, the BI world, data science world, maybe need to be independent or different set of people care about different things and have their Honestly honestly, I think it's the VC dollars
[00:42:58] Unknown:
which are forcing people to pick hyper specialized problems, whereas, you know, a little bit more leeway, a little bit more runway can actually allow us to build more generalized solutions. I mean, I I think it may not be the way these things are computed. I think those can certainly be different for ML and analytics applications. But there are lots of common things that unfortunately have to be repeated, like how you handle PII data, how you handle governance. I mean, like, all of this is fully recruited, all these stacks. So personas may be different, but the need for good data modeling and the need for traceability is something that I hear from ML practitioners also.
Because a lot of the data quality problems that they experience ends up being due to some external dataset that the company is buying, not necessarily
[00:43:56] Unknown:
in how the features are being offered. So I do think there's a case for some consolidation there. Yeah. I'll talk about some of the problems from Airbnb, and they're just, like, they're related to the approach we have. So they're kind of flip side of the coin of decentralizing and democratizing access to data and access to tools. Right? So by building things like airflow and superset and ERF, the, the experimentation framework, we allowed like hundreds of people internally to go and write data pipelines, create visualization, create dashboards, create experiments.
That's super empowering, but that leads to a lot of a lot of chaos and garbage. If it's really easy to do something, then you do it and it leaves a lot of, call them data assets, that are not super well managed. So so if you're gonna enable people to do this, you need to have an equal effort or at least, like, an effort that's gonna catch up in terms of garbage collection, figuring out what's not being used and actively, like, deleting and, you know, kind of, kind of, creating these assets. So we have this data graph that's growing, growing, growing, then the leaves of these data graphs are unused, but everything else inside the data graph, like, looks like it's used because it's used by things that are unused.
So I worked we worked a little bit on that problem at Facebook, which is, like, trying to figure out what's on the edge of the data graph and actively deleting these things, like sending email to people automatically. And if you don't click this button, your asset will be destroyed. Or here's a PR deleting the thing that you created doesn't seem to be used. And that's a challenging problem. What happens with that is as you get down this road, you have, like, hundreds of thousands of tables that you don't know where they're at, where they're coming from, or, you know, who created them and why. And you just end up with, like, more crap than actual, like, sanity in your data graph. I think that's a problem that's a common to large companies that invest a lot in data and problem to solve still. I don't know if it's been solved. Was it solved after I left that Airbnb store group or still a problem today? I can't imagine any large company solving that. Do you mind if I text message I got a text message from a software engineer at Airbnb who's like, I'm still deleting your tables.
[00:46:07] Unknown:
So I don't I don't know. It's it's hasn't been solved yet. That was, like, the downside and the upside. Right? I mean, everyone was creating all these things. I mean, there were so many airflow tags. Like, there were so many creating so much stuff, and some of it was incredibly useful and, like, you know but also so much of it still exists. So that, like, governance piece of remembering to delete stuff was probably 1 of the the biggest things that we got wrong. I would agree with that. I wanna add 2 other things that haven't been mentioned.
[00:46:35] Unknown:
The first is short term versus long term. And, Che, obviously, really curious to hear your thoughts here, but this is in the same vein as some previous comments, but, like, I have this blog post where, it sort of talk about moving from, like, you know, a 100 experiments a week to, like, 700 or something. Right? Obviously, very empowering as Max was suggesting around data pipelines and things. But I think if we're being fair, certainly, my experience on the growth side of things was, like, there was a time, and I hope it's changed, but there certainly was a time at Airbnb when it was just, like, local maxima after local maxima every single week. Right? Like, there was such an incentive to experiment and, you know, not iterate, but really just, like, move on to the next win. Right? Identify a win, move on. I think there are still, like, tentacles or or, like, implications of this in terms of product. Right? Like, we historically, at least on the the teams that I was on, and I was on both the growth side and the host side of the marketplace, You know, holdouts were rarely done at Airbnb. Search may may have been different than Cheah, you know, marketplace team. Love to hear kinda counterexamples.
But on the growth team, holdouts were like anathema. No 1 was running holdouts. And, man, it was frustrating. It was super frustrating because it always came back down to, well, who's gonna maintain, you know, the holdout? And then sometimes there were holdouts and they were accidental and they became these, like, awesome sort of natural experiments, you know? Yeah. Because, like, so many things trended to 0. So the second thing, which is more organizational and I can't remember who was still there at this time, but I think it was either Che or Max mentioned, you know, when we were all working together, the data science team was centralized and there was a really strong identity.
There was incredible work. Right? Like, all the projects that we're talking about today came out of this time period where the data team was centralized. And I think it was 20 2018, 2019, you know, we went through some leadership changes and we used to have a data, you know, a leader pretty high up in New York and, you know, things got decentralized after that. And the data work has never been the same at Airbnb. If, you know, we're being honest, like, thinking about the you know, whether it's blog posts or projects or just, like, you know, the the sort of stance of or or stature of the team in terms of of the valley. It really changed after that and, you know, kinda heartbreaking in some ways. But I don't know. I think that we had a pretty good thing for whatever reason, you know, shifted and should be obvious, it still makes me sad. Well, my theory, it's all about the cookies. Right? So they used to bake fresh cookies every day and take them down with a little cart and give you know, every Airbnb
[00:49:16] Unknown:
employee who wanted cookie, could get a fresh cookie, freshly baked cookie and then the moment that stopped that's where the innovation, like, came to. Those those almond flour cookies Oh my god. Vegan
[00:49:35] Unknown:
pudding. Don't know if you guys remember the avocado pudding. I don't even know what avocado pudding is, but it tasted like chocolate pudding.
[00:49:42] Unknown:
But what you brought up, Lindsay, on a more serious note is actually the problem that modern companies face. In we might have fantastic tools now, but there is no appetite for strong central teams now with 25 people. There's just no funding in any company. And that's why, you know, you're seeing business units funding, you know, analyst headcount, data science headcount, machine and the emphasis on data modeling or uniform ways of doing it, that's just going away. I think the central teams are more in the business of spinning up snow plates, spinning up a bunch of these things, connecting it up, and, you know, enabling 50 to a 100 stakeholders, and they often don't have the empathy for what is happening day to day with, data practitioners.
So I think that shift to centralization to decentralization is actually a pretty strong 1. I don't think as an industry we still have a solve for it. Like, centralized might not be the right
[00:50:42] Unknown:
word exactly. Right? Because it was it was embedded, but there was, like, a centralized ethos, right, of Correct. Like, we're on the data team. We go to the data team meeting. We all you know? I think that the interesting thing was there were these kind of, like, more central data teams. Right? Or, like, more central parts of the data org, like the data infrastructure team, for example, where there were data scientists on that team, and their job was to go and talk to all the different data scientists around the company and, you know, bring back all of the different ways that people were doing things. And then say, hey, you know, we need to consolidate this. Like, these people are doing this in a really great way. Let's show the rest of the company how to do that. But it wasn't like, oh, I'm a floating data scientist, and I happen to be working on search this week because that's the, you know, team that got the prioritization. Right? It was this hybrid model that I think was really effective.
But I think we lost a lot of that feeling of like, and it was as big as, like, 70 or 80 people on the team when we would all get together in the same room and, you know, do presentations and, like, talk to each other about what was going well and what wasn't.
[00:51:47] Unknown:
Yeah. It was probably obvious, but, like, that opportunity. Right? Like, again and and there were other people like me who didn't come from industry, had no idea what they were doing. Right? And I only worked with with, you know, the same team as Nick. Like, we all met through that centralization. Right? Like, I was able to approach Che, whose work I had read about or heard about, you know, in knowledge repo because, oh, it turns out, like, we're in this meeting together and it's, like, not as intimidating to be, like, hey, Che, like, you wrote this really cool knowledge post. I'd love to learn from you. Right? I think there's a bigger trend here in terms of, like, remote work, which I'm largely a fan of. But, like, I actually think there are, like, really important psychological and social functions. Right? That bringing people together on a regular basis play that, you know, wasn't mandated than any of us, like, hang out together. But it was those kind of opportunities which allowed us to, I think, all learn from each other and then build, you know, build build off of those learnings and connections.
[00:52:48] Unknown:
I I think that the whole idea of kind of aggregate identity and its impact on the work that people do is probably worthy of its own episode, if not series thereof.
[00:52:59] Unknown:
1 of the other things kind of on that note that Airbnb got wrong, and I think Lindsay and Swaroop, you both kind of mentioned this a little bit. But at some point, we went from originally, everyone was just called data scientists. And I think that was because, like, I don't know. Who was it? Like, Bloomberg or something said it was the sexiest job of, like, 2013 or something like that. And so we were like, oh, okay. You know, the title for everyone who does data is data scientist. I guess data engineers also. But originally, the whole kind of data team is data scientists. And then at some point, we had this, like, rubric that said all the different things that you were good at. Right? And that was like the way we did performance reviews. And then we started splitting the team into the things that people were really good at. And so we had, like, analytics.
We had, like, data vis focused people. We had ML focused people. We had inference focused people. And it was in their titles, so the rest of the company would look at them and it would say data scientist inference. And it's like, oh, they must work on experimentation and that's it. You know, you have people joining from PhD programs from, And it just kind of boxed them in and said like, hey, this is the job that you do. And so you should spike on this component of this, like, rubric in your performance review. And that's how you'll, you know, progress as a data scientist, you know, with whatever kind of specialization you're in. And I think that that shifting from the generalist mindset to the very specialized 1, boxed a lot of people in where you'd have to go through a whole kind of career shift if you wanted to move from, you know, inference to ML. And you'd have to, like, get manager approval and a different manager would have to, you know, bring you onto their team. Yeah. That 1 really, really got me. That was, like, right when I switched off the data team actually differently.
Because I was like, I wanna do all this stuff.
[00:54:55] Unknown:
Yeah. Yeah. Conway's law is real. I mean, like, how you structure the data org matters a lot. I mean, I have now seen so many data teams that have, like, analytics under 1 house and data engineering under another. What I feel like it never ends well for a growth stage where you're laying down foundations. Like, if you're trying to get an end to end data system to work well, you're gonna have dependencies all the way upstream to data generation and then all the way downstream to driving a decision.
[00:55:21] Unknown:
The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your DBT, Snowflake, Tableau, Looker or whatever you are using and select star will set everything up in just a few hours. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
[00:55:58] Unknown:
Kind of bringing us towards the end of the conversation. As you look forward from where you are now after spending some time reflecting on where you've been, what are some of the critical lessons that you're focused on relaying to your own teams at the companies that you're building and data teams that you interact with? And what are some of the predictions that you have for the next 5 years of data? Why don't we go around? And so I'll start with you, Swaroop.
[00:56:24] Unknown:
Sure. I think I've been saying this throughout the conversation. I think the data practitioners are having to deal with too many things today instead of just getting their main job, which is to deliver business value. You know, they're having to worry about data quality, how to think about cloud cost efficiency, how to think about compliance, standardizing definitions. It's just too much, I think. So this cannot go on with this much fragmentation and not really having that layer, which is not invasive. I think the jury is out now. No 1 wants kind of intrusive control over everything. It has to sit on top of this and allow all these tools to really interoperate really well.
I think that's something that will unlock a lot of value, especially as this decentralization happens. That to me is inevitable. It's gonna happen in the next 2 years. Obviously, I have a point of view. I think metadata driven, everything is the way to go here and metadata driven orchestration, metadata driven governance, express governance as code. So people who are doing the work just declare intent and something else takes care of things. So that's 1 big shift that is happening and will probably bear fruit in the next 2 years or so. Over a slightly longer term, I feel like we still have to navigate the gap between the business terminology and the technical terminology.
You know, in the past, when the stack was simpler, you used to build semantic layers to deal with all this. And now I think people have moved away from it, although things like LookML and all those things and, of course, makers try to solve that problem too. But I feel like that layer helping business users really navigate all this complexity, and I think this promise of the semantic layer has been touted for far too long. It's time to deliver on it, I think. Otherwise, I feel like the budgets will actually start shrinking if you don't start delivering on the value. Right now, it's like free for all, lots of promise. But I think in about 3 years of time, if you don't bridge that gap, I think there'll be trouble.
So I think a a good semantic layer is likely to emerge in the next 3 plus years. What that will be exactly, I don't know. But I think a nice translation from that business lingo to the underlying physical stuff is, I think, the other big shift I see happening in the next 3 to 4
[00:59:01] Unknown:
years. I think maybe 2 of the biggest lessons that I have are and the first core value at Transform is lead with empathy. And I think that that was just something that was really, really kind of just a default at Airbnb. I think almost to a fault at times. Like, at times, you know, people struggled to give feedback or criticism because they felt like it wasn't empathetic. And, you know, I think that there's a better way of doing that. Right? Like, it can be empathetic to give feedback and to kind of try and help somebody grow. But I think that the data team really adopted that at Airbnb. And it was, you know, part of the reason why so many people chose to learn SQL because they knew that all of the data scientists around them would actually help them to answer the questions that they wanted to ask. So there were, yeah, a lot of lessons around, you know, empathy and just generally, like, Airbnb was an amazing place to work. Like, you know, the cookie cart was real.
And so I think that's 1. And then I think the other 1 was, you know, as I reflect on all of the different tools that we built, a lot of it was just around kind of pushing accessibility for who gets to kind of do each job down to, like, you know, people who kind of previously were did were not empowered to do that. And so I think just generally, like, lessons around kind of promoting data accessibility and that being then the mechanism that led to the rest of the company getting really bought into data. And that, you know, then kind of leading to bigger, better tools and a lot of the kind of things that people have heard about Airbnb's data team. Now a lot of that like came from investment because business users found the team to be so valuable.
And then, you know, I guess following up on the stuff, sorry for saying like, a lot of that is, you know, what I'm trying to bring, to transform and making data accessible, I think is the core purpose of a semantic layer. You know, mapping the semantics of the underlying data warehouse to something that the rest of the business can understand, I think, is, you know, the path to data accessibility. And so that's really what, you know, Metric Flow, the open source project that we're working on, is centered around. Yeah. To the first question,
[01:01:11] Unknown:
it's funny, similar to Nick, but 1 of my favorite core values from Airbnb was be a host. And, you know, be a host literally means opening up your home, but it can also mean, as a data scientist, you know, commenting your code, writing UDFs, turning, you know, things into functions, and making them available. And I've kinda changed the wording around a little bit, but it's become a core value at Iggy as well that that, you know, impacts our product development in that. You know, the way that we build product is we start from the I call it, like, inside out. We start from, like, solving our own problems, you know, and using our own product. And if we're running into problems and our customers are as well, that mentality of of, like, being a host and, you know, being empathetic is certainly 1 I've brought to Iggy and I'm very proud of. In terms of hopefully, I'm remembering the question correctly, but, you know, in terms of the future, some stuff that I'm obviously really excited about, not just building product, but in terms of data, A lot of, you know, the tools that we've talked about, a lot of energy over the past, you know, 10 years in data has gone to thinking about data internal to the organization and, you know, at Airbnb, whether that's logs or experimentation or whatnot, you know, I get really excited about turning outward. Right? So, again, inside out, like, not just in terms of how we build our product, but, you know, Iggy, at the end of the day, it's a platform that gives access to various tools and data products. And, you know, those data products allow companies to build better product. And in order to build better product with Igi, you've gotta be thinking about external data. And so there's bigger trends here in terms of, like, open data, open data access, and just, like, yeah, so much more data being generated. I'm really excited to see folks, you know, again, leverage it for the products that they're building.
[01:03:00] Unknown:
So I think, Tobias, I've been doing prediction on your show for more than 5 years now, so we should go back to my prediction for more than 5 years ago and see how wrong I was. Instead of kind of rehashing the same stuff that I've been hashing and, like, the ideas I've been putting forward. I'll try to, like, go deeper into 1. That's not necessarily new, but I think, like, that step of mine for me is this idea of analytics coming everywhere. You know I think like analytics have been caught in BI tools and into like notebooks and into like the tools like specialized tools, right? So a little bit like jailed, analytics are jailed into like specialized tools like Tableau and Looker and and other tools, and I think we want to see like every product become a data product and analytics kinda, yeah, getting out of that jail and becoming more pervasive in the products that we use every day. So we see that already. Right? I'm sure Airbnb's got a really good host, you know, listing dashboard they provide to compare you with other listings in your area. I think every single product out there is gonna have interactive charts and visualizations and in context where you can take action. Right? Analytics are most useful and data is most useful with the context and with the actions that can be triggered from the insights. I think, like, that opens room for beyond, you know, at present, we do have, like, embedded analytics and we have, you know, APIs and things like that to for to help people kinda reassemble or bring analytics into their applications, there's still a lot of work for everyone to do. So that basically you can get the primitives of a dashboard, have an interactive dashboard with all the cross filtering, the drill down, and the drill through type features, all the powerful features of visual interactive analytics and make it really easy for people to bring that in the products that they're building. So that's, I think, a trend is to see it, like the emergence of easy frameworks to do that, and, you know, we're obviously working on this stuff at reset amongst other things. So it's like unpacking the BI tool into its components so you can kind of build your own, you know, reassemble or remix your own BI tool. Another thing I think we're all invested into is like the unbundling. I think of BI and the platforms.
So historically, I think if you look at the Gartner report for data and business intelligence historically, the 2 access that they put is completeness of vision and ability to execute as if these 2 things correlate. The reality is that these things don't correlate. They inversely correlate. Right? If you're trying to do everything through a completeness vision like Microsoft is trying to do with their BI stack, you're also unable to execute. And I think what we're gonna see is unbundling a lot of the data tools, data workflows into an ecosystem much like we have on the software engineering side of things where it's accepted that there's a whole set of languages, there's a whole set of tools, you know, we're not trying to find something that's gonna solve software engineering.
So, like, the unbundling of that and these kind of pieces of puzzle that fit together. So for, you know, Data Hub to work with with Super Sapphire transform and your semantic layer that worked very well with all the BI tools and the next generation BI tools out there for experimentation to, you know, weave into that ecosystem too. So we're gonna see an acceptance of that as these tools gonna emerge and start playing well together.
[01:06:18] Unknown:
Alright. And, Che, you helped make all of this conversation happen, so I'll let you close this out. Yeah. Definitely. In terms of critical lessons for teams that have taken internally and externally,
[01:06:28] Unknown:
I think a big thing for me was Airbnb's emphasis on design and customer customer empathy. And I think that was you know, we I think we can all agree, Airme design team is pretty unique in terms of the halo wheels internally and what the product looks like. You can only look at the difference between Airme booking to see what that ends up being. And I think there's a lot of that to apply to the data ecosystem. You know, up until the kind of flourishing of this of this ecosystem, so many of this stuff was, like, internal tools built on Tableau or something like that. And it was pretty confusing to navigate unless you were the person who built that tool. And and especially for experimentation, which involves this huge leap of faith where you're gonna, like, make pretty consequential decisions and promotions and stuff based upon this opaque black box statistical thing.
Like, design ends up being a big deal. And 1 of the things Airbnb did really well concurrently with the design was they will always show customer videos in front of the company over and over and over. Like, you know, it used to be like, you know, it's highly produced with, like, very sappy music and everything, and it was really meant for you to feel the emotions of this product. And so I think there's a lot that I take away from that. Like, as a data tools company, it's much harder because not everyone can just become in a person running an experiment. Like, you can just go travel on Airbnb. But if you can just bring the emotions and build that into, like, a kinda minimal opinionated design, I think that has a lot of power in in data.
And then in terms of where stuff's going, like, we we're all kinda touching on this that, like, in the past 5 years, we've seen the foundational parts of data start to consolidate around things like segment and Snowflake and DBT or whatever. And this means that people higher up on the stack can make a bunch of assumptions around what data teams look like and how they operate. And I kinda see that stuff continuing. I mean, I think we're all probably big believers in semantic layers and kind of what Nick and Transform are doing and what that would unleash. You know, I think we're kind of experimentation, especially we're all excited for that to continue. And I think what it's gonna look like concretely is that these data tasks that are exactly the same, every company, these applications, you know, are gonna be much better and higher quality and widely used because, you know, you can build better applications when you have stronger foundational assumptions than the below it. So I see data teams kind of continuing to have consolidation at the foundational level, flourishing at the application level, and then, you know, kind of other exploratory tools and kinda more ad hoc workflows, you know, notebook style, whatever, kinda continuing to grow.
[01:08:58] Unknown:
Alright. Well, thank you all very much for taking the time today to join me and share your experiences working at Airbnb. The lessons that you've learned there, both positive and negative, and how you're carrying that forward into your own companies and into the ecosystem. So appreciate all the time and energy that you're all putting into helping make data a better place to work. So I hope you each enjoy the rest of your days, and I look forward to speaking to all of you again soon.
[01:09:23] Unknown:
Thanks for hosting us. Yeah. Thanks, DeBass. Thanks, DeBass. Thank you.
[01:09:33] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Meet the Guests: Former Airbnb Data Experts
Guests' Journey into Data
Impact of Airbnb's Data Culture on New Ventures
Influence of Large Tech Companies on Data Ecosystems
Elements of Airbnb's Data Culture
Mistakes and Lessons Learned at Airbnb
Future Predictions and Lessons for New Data Teams