Tobias Macey: Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Go to www.dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. And to help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host is Tobias Macey, and today I'm interviewing Maxime Beauchemin about what it means to be a data engineer. So, Maxime, could you please introduce yourself?
Maxime Beauchemin: Yeah. My name is Maxime Beauchemin. I think you did a fairly good job at pronouncing my name, which is pretty good. I'm a data engineer at Airbnb, and I'm kind of the main maintainer for Apache Airflow, which is kind of a distributed batch processing workflow engine as well as Superset, which is a data visualization and exploration platform. Before Airbnb, I worked at Facebook as well as Yahoo and Ubisoft. I've been working with data for a very long time. I've been doing data engineering since way before data engineering, actually, the name existed.
Tobias: What is it about data management that first attracted you to the field?
Maxime: It's a very long time ago that I started to work with data. That was sometime around 2001 or 2002, and I was working at Ubisoft, and they started looking into analytics and building a data warehouse to kind of organize all of their company information, mostly like financial information, account receivables, accounts payable, cash flow. So, there was really kind of a financial aspect to their data warehouse and supply chain, and I kind of got dragged into this project because there was a need for it, and I just knew the right people at the right time and just started picking up a few books about data warehousing and started building data warehousing.
There was some different technology stack at the time. We were using Microsoft SQL Server, something called Hyperion Essbase, which is an all-out database, and we were doing just -- I was writing a lot of SQL, a lot of store procedures, and some ETL tools at the time.
Tobias: How would you define the term "data engineering", and how has your definition of that term evolved in recent years as it becomes more of a recognized discipline?
Maxime: I feel like we almost coined the term "data engineer" while I was at Facebook in, I believe it was like around 2011, 2012. As I wrote in my blog post, I started at Facebook as a business intelligence engineer - that was my title - and then we started kind of realizing that we were really using very different tools, going away from traditional tools and ETL tools because there was just no tools out there that could manage the volume of data that Facebook had at the time.
So, we started building our own tools, we started kind of developing new processes, and at that point, we kind of renamed ourself and we changed the name of the team, and started opening a position to actually recruit for data engineers at the time, which was a fairly new term as to the definition of what a data engineer is or might be. I feel like that we're kind of a special breed of software engineers that are specialized in data infrastructure, data warehousing, data modeling, data crunching, and metadata management.
That is a fairly wide description, but what's important to note is we're basically software engineers with a deep focus on data.
Tobias: So, it's kind of analogous to the idea of a full-stack engineer, but applied to the particular realm of analytics and data?
Tobias: Yeah, I know that that's another term that has come into common use in recent years as the idea of what makes a software engineer a software engineer, has continued to evolve, and the complexity of our systems has continued to grow. So, the problem domain that one person needs to be capable of handling is growing larger as the expected delivery times are becoming shorter.
Maxime: Yes. On one side, there is some sort of specialization, but we also expect for people to be kind of wider and cover more and more fields or related specialties. So, it's really hard for someone that's very specialized and that knows only, kind of, one aspect of software engineering or data engineering to be useful at a company. It's always a balance of going deep in some areas, but kind of wide and being able to do kind of full stack as much as possible too.
One other fact is the idea that in data engineering, as in software engineering, the only constant is change. Things are changing very fast. The Hadoop ecosystem and the data landscape is definitely diverging still. We don't see a lot of convergence. There's more database than there ever was. There's more kind of pieces of technologies, and platforms, and frameworks, and libraries. There's this explosion of knowledge and code that we're essentially trying to stay on top of as data engineers.
Tobias: And do you think that the dev ops movement that has sort of come to prominence over the past few years has had any impact on the discipline and the concept of data engineering as a whole, and I'm wondering if there are any particular kinds of crossover that you've seen whether as far as the philosophy or the tooling that's available.
Maxime: I think in terms of tooling, companies share all these pieces of infrastructure. So, when you talk dev ops, I'm not sure exactly which part of infrastructure you'd be talking about. But, let's say one component might be stuff like continuous integration, unit testing, automating the work of developers and data engineers are definitely just as interested in that as people in dev ops or any type of software engineering.
One place where we see, kind of a historic kind of divergence and things are starting to change a little bit is the way the data that we worked with. So, dev ops, or ops group will have, typically, an ODS, or an operational data store where they'll have time series databases, or things like -- at Airbnb, we use something called Datadog. At Facebook, there's something called ODS that was all about kind of real-time metrics of machines and performance metrics of anything related to development.
On the other side, on the data engineering side, we've been typically very focused on the data warehouse itself. In a lot of cases, kind of slower systems, kind of 90 days analytics batch processes. Historically, it's been different kind of technologies for databases for both sides. We're starting to see more and more data engineers going to get outside of batch processing and get more into real time where technologies like Spark Streaming, technologies like Druid. Slowly, we see dev ops people using the same databases as data engineers.
One example of that, at Airbnb, would be the use of the Druid.io database, which is a real-time distributed column store that works very well for real-time analytics, or fast analytics, kind of fast-paced big scans, and being able to crunch a lot of data. So, both the people in dev ops and the people doing more classic analytics and data scientists would use this database.
Tobias: And for somebody who is working in data engineering, how much interface do they have to the actual managing of the underlying infrastructure versus just having an operational team who provides those servers and deploys the services, and the data management team is responsible for building the pipelines and tuning the actual services running on those instances?
Maxime: I think it's a factor of a big, perhaps, like the company or the data team actually is. But, we definitely see, at some point in time, some sort of specialization where, at the very beginning, maybe as the first data engineer, or the first software engineer where the data focuses a company, your first task is going to be around setting up some infrastructure, getting things like Kafka, and Hadoop, and Spark up and running. If there's only a few data engineers in a team, then there's probably going to be a distribution of tasks where people will do a little bit of both, and over time, it will kind of shake out, in a certain way, that people will specialize. Whether managing, deploying, and managing and maintaining the data infrastructure is the role of the data engineer, I would argue that it's not the case in a bigger company that other people that are, perhaps, a little bit more dev-ops'y, some people that are a little bit more focused on infrastructure would take on those tasks in a company, and it's a very different skill set in a lot of cases.
So, I think, over time, data engineers will tend to go and focus more on things like the data warehouse itself, and all the data plumbing, and data structures, and getting consensus as to how the data should be organized, and it's for the company.
Tobias: For somebody who is interested in getting started in the field of data engineering, what are some of the necessary skills that they should possess, and what are some of the most common backgrounds that you see those people coming from?
Maxime: There's different types of people. There's people, the dinosaurs like me who came more from the business intelligence and the more traditional classic data warehousing field. But, if you're coming from that background, you're going to need to be able to code, or you're going to need to develop the skills around version control writing code and starting to change your processes so that your daily tasks look a little bit more like a classic software engineer.
So, there's definitely people would have to kind of recycle their skills, and the people that are coming from there usually are very strong at things like data modeling, and ETL, and performance tuning, and it's typically people that know, very well, how databases work, perhaps people that were DBA in a previous life.
Then, there's other people coming from other places, like just new grads out of computer science. There's people coming from the field of data science that realized that they're perhaps more interested in building persistent structures, right? People that were in data science, but realized that they're more interested in doing engineering-type work.
So, I would say patience is also probably a very important thing because a lot of the batch processing, data pipelines can be pretty cumbersome, and there's definitely a part of data engineering that is about kind of plumbing, data plumbing, and managing the pipes, and sometimes all hell breaks loose, and you got to get in there and fix this stuff.
That's one aspect of the job. Maybe, that's a little bit less glamorous, the plumber aspect. But then, there's all sorts of other aspects, like if you want to automate your work, there's data engineering as in any type of software engineering, there's always tons of opportunities to create abstractions and create tooling where you can automate, kind of, your own work as a data engineer, and then build, say, services, or systems or framework that do the things that you would have done manually as a data engineer before.
So, for the data engineers out there who are getting bored with the job, or that are thinking data engineering, there's a lot of data pipeline writing, and that's not very exciting, there's always an opportunity to take the things that are redundant, and build services and systems around it, and that's kind of what I've been doing over the past few years with Airflow that's been super, super rewarding.
Tobias: Yeah, in my other podcast, we had a nice long conversation about the work that you're doing in Airflow, and since then, I know that it's become an Apache-incubated project. Wondering if you can touch briefly on some of how it has evolved since the last time we spoke, or it had been using Celery as the execution layer. I don't know if that's still the case.
Maxime: It is still the case, and I can talk a little bit about the Apache Software Foundation, and how it helps a project grow, but it also really changes the dynamics of the project too. I've been through that for the first time recently, and Airflow started as an Airbnb project, and we had kind of full control over the project, and we would release whenever we were ready. At some point in time, we were interested in joining the Apache Software Foundation because there's a lot of companies that consider it kind of a risk to start running software that they don't have any guaranteed control over.
What the Apache Software Foundation does, as you start the incubation process, is first you have to donate your code, and your trademarks, and your intellectual property to the Apache Software Foundation, which is only going to make sure that people are not going to change their mind about sharing. So, from that point on, the Apache Software Foundation owns your code and the trademarks, and has some responsibility around some of the legal aspects of the project. If someone was to try to sue us because, say, the name Airflow has been used by some other company, or something like that, the Apache Software Foundation would help us, would probably manage this litigation.
This is as much as I know about the Apache Software Foundation, so maybe some of the things I'll say aren't 100% right. Hopefully, they mostly are. But, a part of the process, also, is the Apache Software Foundation makes sure there's different people from different companies involved jointly, and they create what they call a PMC, which is, I believe, a project management committee that has to be made of five or six, or perhaps more committers that have, kind of, the same level of involvement with the project.
In the case of, for example, I don't know the whole story with Apache Storm. But, at some point, Twitter had developed Storm, a lot of people were using Storm, and at some point, they were like, "Okay, we're done with Storm. Now, we're just going to build this other thing called Heron, and we're not interested in contributing any more code to Storm."
So, at that point, they gave the software to the Apache Software Foundation, and other people from other companies created a little committee and managed, jointly, the project together. From a developer's standpoint, it's also a pretty interesting thing because, right now, the work that I do, say, on Airflow is, if Airflow is owned by Airbnb, and something changes as to my employment status with Airbnb, I might not be able to work on Airflow anymore, and I might have to forfeit the project, or something of that nature.
But, with the Apache Software Foundation, I know that I'll always be an Airflow committer and have the same kind of relationship with the project, regardless of my employment, which is cool. I think also, there's just kind of a brand that's associated with it too. So, companies have a lot more trust into, say, installing or running an Apache Project than they would have with just running a piece of software that's developed at some other company.
Maybe a few more things about the Apache Software Foundation. Joining Apache has definitely kind of slowed down the pace of change for the project, and I think that's part of a maturing project to kind of slow down and focus on things like stability and just making sure you have like a solid product that works for everyone. So, the pace of development slows down a little bit, but the quality of the releases goes up. It's been kind of challenging for the Airflow project for us to come up with a release. I believe the last release was about a little more than six months ago, and that people in the community are working on the next release.
But, for different reasons. Now, it's like we have all sorts of code that comes from different companies, and that's all in that same melting pot, and we have to make it work for everyone, and that can be challenging, especially the first release. But, I'm hoping that, with time, and with the community, we're going to be able to come with like a steady release process pretty soon.
Tobias: Yeah, definitely seems that having the stamp of the Apache Software Foundation, it's a boon to a project, and like you said, it does connote a certain amount of maturity and stability. It's interesting how many of the different widely used projects in the big data and data engineering space have ended up under the umbrella of the Apache organization.
Maxime: Right. Especially considering all the restrictions that come with it, and the bureaucratic process that goes along with going Apache. So, we have to use JIRA, for instance, and there was limitation on how we could use GitHub. There's these old mailing lists that we need to use for legal reasons. So, we're, say, unable to use GitHub Issues, we have to use JIRA instead, and at the beginning of the project, that was a little bit frustrating to just have to step back and use some of the Apache infrastructure that is really kind of decade-old at this point.
If you do a Google search and you'll end up on some Apache mailing list website where it looks like the CSS is from the 1900s. But, all of that, I think it was a really good choice for Airflow, and I think that it really contributed to project's growth because it attracted very talented committers and developers that were like, "Hey, I can be an Apache committer," or some of them were Apache committers before on other projects, and then they can really get involved and trust that things are going somewhere.
If I wanted to get involved with, say, Luigi, which is a similar product to Airflow in some ways, I would have to kind of negotiate with Spotify or somehow try to get them to let me join the project, and there might not be a process for that, there might not be any guarantees. With an Apache project, there's going to always be a way for a company to get involved and to become decision makers as part of the project.
Tobias: What do you see as some of the biggest challenges facing data engineers currently, and the discipline of data engineering?
Maxime: One of the new things is all the real-time data. You know, Spark Streaming, Samza, databases like Druid, that's kind of a new stack for a lot of SNS. It's also a new type of technology, a new set of constraints. So, people adapting to things like beams, Apache Beam, which is the Google Dataflow, based on the Google Dataflow paper. Then, not only knowing all the batch processing stack, which is fairly complex with things like MapReduce, and Hive, and schedulers like Airflow. Now, on top of that, you have to go and learn the real-time stack. Or, perhaps, it's in bigger organizations, it's different people that will focus on these different aspects of data engineering.
What else? So, some of the challenges is always working with data science has always been a challenge, and there's always this kind of push and pull as to whether a data engineer should be building kind of horizontal infrastructure that would be used across the board, or whether the data engineering should go and work closely with a project team. So, that would mean something like someone at Airbnb might be the data engineer assigned to a specific product team, say working on the new trips feature that we have, and build the data structures for them.
The data engineer has been always kind of the pivotal point as to whether they work on horizontal or vertical things. And sometimes, it's hard, kind of, to place everyone in being in that pivotal point, and there's all these forces pulling you in different directions.
What else is challenging? I mean, just keeping up with the diverging data landscape and ecosystem has always been a challenge. There's a new flavor of database every day, and there's just so many packages and libraries and frameworks that you have to stay on top of, and that's always been exciting, but kind of challenging too. If you stay for too long, put at the same place, or kind of the same stack, you can be kind of pigeonholed in a certain role, or on a certain stack. Those are some of the challenges I can see.
The main challenge for a data engineer, I think, that I see is operational creep. So, I think it's true of a lot of couriers and information management in general, and software engineering. But, it's really easy, over time, to build more and more things that you have to maintain, and to go from a place where you start, perhaps, like doing 90% of your time as development, and 10% of your time as managing that things are fixing stuff, or refactoring code.
Then, over time, as you stay with the company for one years, two years, three years, it's really easy to pivot and realize at some point that you're spending, say, 80, 90 percent of your time just kind of fixing stuff and doing some fighting fires, and not adding a lot of value because you're just kind of the guy that keeps the things running.
So, it's important for people that identify this to go and invest on tooling and on refactoring and paying off that technical debt so that they can stay challenged and so that they can keep creating as opposed to just kind of fixing stuff that they've built in the past.
Tobias: You touched briefly on this, but how much analytical knowledge do you think is necessary for somebody who is working as a data engineer?
Maxime: So, I think you need to have, definitely, this analytical instinct, and that is vital for data scientists to have -- people have called that product sense too. I'm not sure if that's what you are talking about here, but you need, say, as an analyst or as a data scientist, you absolutely need to have that product sense or have that intuition as to just being really curious and kind of prone to dig in until you find the answer you're looking for.
So, I would say that is critical for analysts and data scientists. I would say it's really important for data engineers too. Though, for data engineers, there's that other urge, which is the urge to build infrastructure and to build things that are there to stay that you have to balance with the more, sometimes, ephemeral analytical urges you might have.
Tobias: Also, how much sort of statistical analysis or understanding of some of the different mathematical principles or probabilistic principles are necessary for being able to properly manipulate the data so that it is in a shape that is accessible for the people who are doing the end analysis on it.
Maxime: I think this part, I would call also, that if we call this data modeling, which is very different from statistical knowledge about stats, let me talk a little bit about data modeling. So, data modeling is not a perfect science. There's been all sort of books written on it. There's this concept of star schemas, and snowflake schemas, and Ralph Kimball and Bill Inmon are people that wrote about that stuff in the 90s. Those books, in part, still some aspects of those books are very relevant, still, today. Some aspects might be a little bit less relevant than they used to be.
But, data modeling for a data engineer is a core skill. It's how are you going to structure your tables, your partitions, where you're going to normalize and denormalize data in the warehouse, are you going to be able to, say, retrieve attributes of, say, the customer at the time of the transaction versus the most recent attribute of the customer. So, how do you model your data so you can ask all of these questions?
Data modeling for analytical purposes is very different than, say, data modeling for OLTP type of applications. So, OLTP is just classical kind of databases structure for building simple software. Data modeling, I would say, is very important now. Talking about knowledge about stats and statistical analysis, I would say not very much so, and I would even get people outraged on that, but I would say it's also not very important or not as important as one might think for a data science. Because, a lot of what we do in analytics is just counting things and trying to figure out are we doing better than we used to, and what is the growth rate, and understanding a seasonal pattern.
So, you need to have this analytical instinct and that product sense, and those ideas. But, in a lot of cases, you're really just counting things. The stats that we use and the stats that I worked on more recently that are kind of very important, but that we abstracted out for pretty much the whole company is working on an experimentation framework. So, most modern companies do experimentation with their users, and that, usually, people will call experimentation like an A/B testing framework.
It's really important for a company, especially a web company, to be able to run a lot of experiments. Those experiments, usually, you'll have different treatments in a control, and you want to see whether there, the changes in behavior are statistically significant. But, say, at Airbnb, I was involved in building some of the data pipeline for our A/B testing framework named ERF. I believe there's some papers out there, or probably some videos and presentations that some of my colleagues have done over the past year or two.
But, that part of stats has been commoditized at Airbnb where now you can go and create an experiment, and deploy it, and consume the results. Of course, you need to know what a p-value is, and you need to know what confidence intervals are from just a consumption perspective, and you don't necessarily need to know and understand all of the aspects of the mathematics behind it.
I would say the stats work in a lot of cases, and analytics is a little bit overrated. In some cases, it's really important for data science, but I think, sometimes, it's often overrated, and it's also often very well-abstracted out too. So, you can easily do some very complex things by using very simple libraries and functions with clear APIs.
Tobias: To make sure that the data that you're working with is of sufficient quality, what are some of the considerations that you need to be aware of when you're establishing new data sources?
Maxime: Data is kind of a jungle. There's all sorts of data out there, and as a data engineer, I didn't talk about that workload before, or that burden which is data integration. So, you need to go pretty often and go and fetch data coming from different partners, and different companies, and you need to integrate, say, your referentials, like your list of users, or your list of accounts and transactions with some external service providers. I think that's as challenging as it's ever been. We thought, at some point, that the B2B data flow would get kind of fixed over time, and it would become easy that you could just kind of have some easy APIs that would sync up your data with, say, Salesforce or some of these service providers.
Data integration is as challenging as it's ever been. I think, going back a little more to your question as to data quality, in some cases, we want to get data from other systems, and we're concerned about data quality because we don't have control around the instrumentation of how this data is generated. The first thing is probably to make sure you get things right where you're actually generating the data.
So, say most web companies will have some sort of, what we call, an instrumentation framework, meaning that engineers that want to track certain actions on a website, or on mobile, will have a way to put kind of tokens in little trackers on the site to be like when someone hits the booking button at Airbnb, make sure we track that.
I think it's really important to have, what we call, schema enforcement as upstream as possible. So, that means as you generate that data, you want for that data to come to life, to be originated with as much quality and have all the dimensions and all the metrics that you need from that moment on. That can be really challenging to do that.
Now, when you integrate with external sources, so if you're fetching data from Salesforce, API, or in our case, we use Zendesk, which is kind of a ticketing system provider. When we get data from them, sometimes you don't know what to expect. They might change their API overnight and not let you know, or a certain referential that you might have some very specific rules on might change over time.
In batch processing and in stream processing, you can embed some data quality checks. So, that would mean that as you write your pipeline, you might have an idea of problems that may occur in the future and set up some alerts and some breakpoints saying, "If there's certain variables that go over a threshold, don't load that data into production," and if there's a value I don't recognize for a certain field, send an alert.
That's really very much case-by-case, and in a lot of cases, it's about kind of developing an immune system over time. So, that means, at first, you work with an API. You take for granted that it's not going to change, you build your pipeline and one day, something happens and the feed, there's a bug of some sort, or the data is not what it should be. Then, you might go and add some more checks, and kind of develop some mechanism to alert or to prevent this data from making it to production before someone looks into it and fixes the issues.
One pattern that we do a lot at Airbnb on Airflow is as we do batch processing, very often, we'll stage the data. So, that means we bring the data into a staging area. You can think of it like if you have a warehouse, that would be like where all of the trucks kind of drop the things before they get sorted out and brought into the warehouse. We have this persistent staging area where we load the data, and then we'll, perhaps, run a set of checks on it. So, stage check and then exchange, and the idea of exchanging is to bring that data into production or into a validated space.
With Airflow, we made this little subdag, so it's like kind of a construct that we reuse everywhere, and it's the nature of stage check and exchange, and over time, the check step of that pipeline will grow more complex over time as we identify patterns of data quality issues.
So, we might say like if there's a week-over-week change of more than 30%, fail the pipeline and send an email to the owner of the pipeline, for instance, and we have tons of these checks all over the place at Airbnb.
Tobias: Would you say that that's fairly analogous to the idea of unit testing and continuous integration for standard software development, but applied to data?
Maxime: I think it's fairly different because it's reactive and it's production code. If you look at continuous integration, you're trying to prevent logic or changes to make it into production. So, if you do your job well with your unit test and your continuous integration, you're not going to deploy code that would result in problems where data quality checks and data warehousing is more of a reactive thing of like we have the logic. We don't control the ingredients that are coming in this recipe. But, we'll make sure to, if we detect certain things, that we won't bring that data into production.
I guess, in some ways, instead of the end code that would make it or not make it into production, it would be more like data that would make it or not make it into production. So, it is kind of analogous in some ways. I can talk a little bit more about unit testing in that part of continuous integration for data warehousing.
The state of that is not really great, and a lot of people ask like, "What is the best practice for Airflow?" Like, how do you run your unit test and how do you make sure that when you launch a pipeline that it's not going to break production. The answer is there's probably as many ways to test and validate a pipeline as there are engineers and pipelines out there. We have different ways of doing it, but one thing that's clear is that you cannot really have an at-scale development environment.
So, you look at when I was working at Facebook, we're close to the exabyte in the data warehouse, and you could not just say, "Oh, we're just going to have a dev instance of the warehouse that's also going to be an exabyte." So, really often, you need to create some sort of microcosm of your data warehouse, and sometimes it might just be pushing some sample data through and making sure that there's no fatal errors that are happening, and making sure that all the rows that made it in the pipeline actually made it, or got summarized in the end.
But, the reality is data engineers don't systematically test a lot of what they do. They'll probably test the little piece of pipeline they work on. So, that means they might run this piece of the pipeline and diverge it into an alternate temporary table that they can check. But, once it's checked, they'll just pipe it right back in and push it into production. Just because doing things right or by the book would be super prohibitive in terms of infrastructure cost, and kind of the how much more confident you're going to get for the amount of work that's required, sometimes is not worth it.
It's really kind of a case-by-case, and there's all sorts of techniques out there. I should probably write a blog post on the subject one day, but the reality is that data engineers are often less thorough than software developers just because there are costs involved and the challenges involved.
Tobias: Have you seen any points where the work done by data engineers and managers of data structure have bought back into more mainstream software development and systems engineering?
Maxime: I know one aspect of this is that the same way that data engineers are doing more software engineering, I think it's also true that software engineers do a lot of, or a fair amount of data-oriented work, right? So, in places like Airbnb or Facebook, or any modern web companies, a software engineer needs to be able to set up an A/B testing test, or an experiment so that they can measure whether the change they're making to a website is going to -- to measure exactly how it's going to influence the different metrics that they're trying to move.
From that perspective, software engineers are doing some of the things that data engineers do. So, writing pipeline or instrumenting a metric, building dashboards, and building their own pipeline and dashboard is not uncommon for a software engineer working on a product.
Tobias: How do you see the role of data engineers evolving in the next few years?
Maxime: I see that the roles are going to become a little bit more formal and balanced in regards to data science. I think a lot of companies have recognized that they needed data scientists in order to compete in their field in their business. I don't think all companies have identified that they need a data engineer to organize the data structures and pipelines that the data scientists are going to source from.
One fact is that data scientists are pretty horrible at building infrastructure and data pipelines, so they'll build these pipelines that are very brittle and will fail over time that are not manageable. Typically, the data modeling is not done right, and there's a lot of throwaway work and redundant work where all the data scientists will go from the raw data and then do their own analysis, and apply their own transformations, and then they end up with big problems, like a metric with the same name that has very different value depending on who computed it in data science.
So, companies, I think, that have kind of skipped this part of, say, data warehousing, data engineering, get serious about their data structures, and how they organize their data and meta data. They're suffering because the data scientists are doing that work that are not that great at it, and there's no consistency in the way they look at their numbers. You get into this issue where no one can trust any number, where the CEO is like, "I don't even want to see the dashboard, because I know it's wrong because it's different from that other dashboard."
I see also data engineers starting to do more abstracted work. So, that means whatever a data engineer does today, which is building pipelines and building the warehouse, I see that engineers starting to build services and frameworks that other people within the company can use.
An example of that would be, say -- or an obvious example is the A/B testing framework where, maybe, originally, you would have, for every experiment, some data scientist would go and do all the stats work necessary to figure out whether that experiment was successful or not, or how it moved different metrics. Now, the data engineer, along with software engineers, and with data scientists, together, will build the reusable framework that can be used for all experiments.
That's just one application of reusable components that data engineers can build. So, there's other ideas like cohort analysis, aggregation framework, all sorts of computation framework that abstract out the common data work that's done in different companies. I see the forces that I was talking about earlier. Over the next few years, our data engineers are going to align more with their verticals, or are they going to align closer to their infrastructure peer and work more hours on old product?
That's still unclear to me, and we have to figure out which way it's going to go. My draw and my personal interest is to go towards more horizontal than vertical, but I'm not sure which way the industry is going to go. Then, one question around the future and the next few years, and one thing that I truly hope we start to see is some sort of convergence in the data ecosystem.
So, that means can we all agree that we should all use Spark Streaming and not Samza, or can we all agree, kind of what happened with Kafka, people kind of converge and say, "Kafka is the tool that we're going to use, and you won't have to choose between five or six different tools or frameworks." Hopefully, there's convergence in that area where people agree on how they should build and maintain their infrastructure.
Also, I'm hoping that when you think about the data infrastructure work, which is installing and maintaining Hadoop clusters, and Druid clusters, and Spark Streaming clusters, I'm really hoping that some cloud providers, or some service providers commoditize that so that every company does not need to go and install and maintain all of that infrastructure. It's just inefficient. So, that might be a trend in the future where we see some really good service providers where you can stand up a Druid cluster in 10 minutes and get value from it right away.
Tobias: One of the things that you sort of alluded to briefly is the idea of data scientists writing some analytical code that that needs to be operationalized. How much of that responsibility falls on data engineering versus a more full-fledged software engineer who needs to then take that analytics code and write it as a scalable piece of software?
Maxime: That's a very good question as to whose responsibility is it to make the data scientist obsolete. Is it the data scientist's job to kind of automate their own work? I don't think it's their natural draw since they're not necessarily engineers, and their skills don't necessarily align that way. At both Airbnb and Facebook, which are my two most recent companies, there is some form of a machine learning infrastructure type team. I believe, at Facebook, it was called Data Science Infrastructure, and at Airbnb, I believe it's called either machine learning infrastructure, or data science infrastructure as well.
That's a slightly different role. Are data engineers the right people to do this? I guess we're getting into one of these gray zones where we're looking for a unicorn type of person that has all those skills in the middle of that Venn diagram of data science, software engineering, data engineering, and those people are extremely rare. If you find one, just make sure they don't -- you keep them for a long time because it's almost impossible to find these people.
It's hard for people that just -- so, say, if you put one data scientist, one data engineer, and one software engineer in a room and you try to make them work together, their results may vary. But, I think this idea of data science infrastructure is super interesting, probably one the most exciting areas right now, right? We can commoditize data science.
I believe, in a lot of cases, as I said earlier, data science value is not necessarily to do like one really intelligent machine learning model to solve one simple problem, but I believe the real value is to do some basic machine learning across the board, to not necessarily do something very complex in terms of the math, or the libraries, or doing some very complex data science work. It, really often, is just about having just a little bit of machine learning applied in the right place and across the board, then commoditizing that.
So, hopefully, we're going to see some of that. Like, right now, I think it's only a handful of companies doing that, and then kind of struggling. There's not really companies that are doing that, or services, or libraries that are provided to do that sort of stuff.
Tobias: Are there any other topics that you think that we should cover before we close out the show?
Maxime: I think we covered a lot of things. I think we're pretty good.
Tobias: It's definitely a large space with a lot of different concerns, but I think that it's important to sort of lay the groundwork of when you say data engineer, what is it that you're really talking about, and I think that we did a pretty good job of at least doing a good approximation of that.
Maxime: Right. At least like centering, finding the middle point of the core of what is a data engineer. Then, from that point, you decide how far from that point you want to allow people to go to. But, the core, to me, is right around the data warehouse and the pipelines that organize the company's data and metadata.
Tobias: Alright, well for anybody who wants to keep in touch with you or follow what you're up to, I'll add your preferred contact information to the show notes. I just want to say I really appreciate you taking the time out of your day to share your thoughts about data engineering and your experiences working with it, and I hope you enjoy the rest of your evening.
Maxime: Thank you very much. That was an honor, and a pleasure to be on this show.