Summary
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- A few announcements:
- There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
- The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
- If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
- Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists
Interview
- Introduction
- How did you get involved in the area of data management?
- The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms?
- What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators?
- Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team?
- What are the benefits of splitting the responsibilities of data engineering and data science?
- What are the disadvantages?
- What are some strategies to ensure successful interaction between data engineers and data scientists?
- How do you view these roles evolving as they become more prevalent across companies and industries?
Contact Info
- Website
- wdm0006 on GitHub
- @willmcginniser on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Blog Post: Tendencies of Data Engineers and Data Scientists
- Predikto
- Categorical Encoders
- DevOps
- SciKit-Learn
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:13]
Unknown:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.
We've got a couple of announcements before we start the show. There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata dash sand dash jose to register and save 20% off your tickets. The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices AI for business. Go to dataengineeringpodcast.com/aicondashnewdashyork to register and save 20% off the tickets.
Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018 and register. Your host is Tobias Macy. And today, I'm interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists. So, Will, could you start by introducing yourself? Yeah. Sure. Thanks for having me on.
[00:02:11] Unknown:
So my name is Will McGinnis. I'm the chief scientist at Predicto. We're a startup in Atlanta. We have a software platform that helps big industrial companies predict failures and large transportation assets. So planes, trains, cranes, that kind of thing. And we do that by taking in a huge amount of sensor data and maintenance data, all types of kind of messy, maybe not as managed as 1 would like data. And we, you know, we clean it up, we merge it, use our machine learning engine, and try and tell somebody what they need to do ahead of time. Other than that, I do some, open source work. So I'm a maintainer of categorical encoders
[00:02:54] Unknown:
to the scikodearn contrib project. And do you remember how you first got involved in the area of data management?
[00:03:00] Unknown:
Yeah. So I felt kind of into it, maybe, you could say backwards. My my educational background is mechanical engineering. So I did undergrad and graduate school in that. And my research was in trying to predict, wear based failures in different aerospace components with physics models. So we would, you know, go do these experiments and have, you know, huge amounts of time series data and try and build some model that had to predict it. And around when I was finishing that up, I met, the the founders of Predicto and ended up joining as first employee. So really early on, you know, I got to wear a ton of different hats. So I was kind of trying to do the machine learning part, but, you have to do all of the data management parts before that. So for the 1st couple of years, most of the work was really trying to build out a good data pipeline, data management, how are we gonna take in all these different kinds of data without, you know, going through a different data management process for every customer. And, you know, we learned a a ton as we went along, but we're kinda doing it on the fly.
[00:04:08] Unknown:
You recently wrote a blog post about the tendencies of data engineers and data scientists. And given the fact that you started off as the first hire at Predicto, I'm sure you got to wear both hats for quite a while before you had enough people that it made sense to actually split out those different responsibilities into separate roles. So to start with, I'm wondering if you can just explain your definition of the terms data scientist and data engineer given the fact that they're often very overloaded and people will use very fluid definitions when they're referring to each of those different roles.
[00:04:44] Unknown:
Yeah. Absolutely. I mean, I think they're they tend to be kind of fuzzy titles, but I try to think of them and really any job title in terms of the domain of work that you're doing and the the methods that you're using. Right? So think a civil engineer and mechanical engineer are both doing engineering work, but in 2 different domains. Data scientist and data engineer are in the same domain. They're both dealing with data problems, data analysis, data systems, but the style of work that they're doing is different. Engineering work and science work are, you know, very different in terms of how you manage them, how you scope, different things, how you require Gatherments, or if you even can. So the domain and the things that you're dealing with are very similar, and there's a lot of overlap. You know, I think most people do a little bit of both. But when it comes down to actually trying to manage your tasks and decide, you know, what am I gonna do this week and how am I gonna let people know what I did?
They're very, very different. So, I mean, I I think it applies to basically any, job, especially in software where I think we we like to pick a lot of different really granular, job titles often. I think picking tools, so somebody will say like, I'm a Hadoop engineer or something like that, which I think is maybe a little bit strange. But I think the main takeaway is that understanding the the workflow of science and data science kind of work versus engineering. So agile may not be, particularly useful for data scientists if you can't really scope the work that you're doing accurately enough to to be very reliable. That's the the first order
[00:06:26] Unknown:
of understanding for that. Yeah. I could agree that identifying your role by the tool you use is rather strange because you can kinda think of it as if a carpenter were to use the same approach, they could say, oh, I'm a hammer engineer and that person over there is the saw engineer where, you know, you you ultimately working towards the same goal, and it doesn't really make sense to be so granular in separating how you do your job.
[00:06:52] Unknown:
Right. Right. It's I I think it comes down to just both roles are dealing with data and care about data. At the end of the day, I think most of the time, customer of the data engineer is the data scientist, where the customer of data scientist is probably some business unit or some business owner. And, you know, day to day, how 1 organizes their tasks and decides what to do is, you know, a data scientist is probably popping things off of a queue and they're not really sure how long this task is gonna take or it's very iterative. Data engineers can kind of scope things up into more granular little tasks and,
[00:07:25] Unknown:
share them amongst their peers more easily. And when I was reading your post too, 1 of the things that stood out is that when people use the term data engineer, a lot of times they'll use that to refer to somebody who builds up the data pipeline and is responsible for all of the data plumbing, and the data scientist is usually the person who's seen as the 1 who's communicating with the business about what the data means and actually creating the presentations of it, but using the sort of description of the role as the engineer is the person who takes the domain and makes it predictable, it opens up the idea of that role to being able to be somebody who actually does create the front end of the presentation layer as well where you may be, stewarding the data all the way from collection through to presentation, but you're not necessarily doing the exploratory aspects of understanding where to find additional sources of data or creating new ways of interpreting that data. You're just working on making sure that all the different steps that the data takes is predictable and repeatable and that may even include being the person who creates that presentation to the final business user.
[00:08:33] Unknown:
Yeah. Exactly. And and, you know, I I I try to think of the separation between data scientists and data engineers also in terms of what physical thing they're going to to hand to 1 another. Right? So if a data scientist, you know, is gonna develop some model in Python and, you know, send over some untested script and kinda throw it over the wall and say, alright. Go figure this out. That's maybe not a great collaboration for anybody. Right? The the kind of joint role that they have is to as as 2 teams or 2 people, or even if you're just 1 person trying to separate things, is to make a a neat integration point where the data engineer's enabling the data scientists to do the analysis they need to do and put something into production, but they're not gonna completely upend their own data pipeline for it. Right? You don't wanna have to do a full deploy of something just because somebody needs to update a model. Right? So that takes kind of some advanced thought, and
[00:09:38] Unknown:
and you need to architect your system so that you know, data scientists can do their job without disrupting the data engineer's job. And 1 of the things that I was noticing too as I was reading through your post and preparing for our conversation today is I see a lot of parallels between the way that you describe the relationship of the data scientist and the data engineer and the same relationship that occurs between, developers and systems administrators where developers and data scientists are agents of change where they want to be able to iterate quickly on things and see the work that they do get released to production without necessarily more interested in more interested in restricting scope and creating reliability and consistency. And so there's there's this point of tension between the 2 roles and responsibilities where they're all ultimately working towards the same goal, but they're going about it from opposite ends. So I'm wondering if you have sort of drawn the same parallels or if you have any thoughts on that idea. Yeah. Absolutely. I mean, I think it's an extremely similar dynamic and 1 that comes up directly here as well. Right? So the if you have completely separate operations,
[00:10:52] Unknown:
data engineering, data science, and, like, application development teams, they're all gonna be in some kind of tension with each other. Right? Especially in the data scientist needs, the data engineer needs, operations kind of pipeline. And I don't know that in any 1 organization there's like a a real magic bullet there. The trade off is the more separation that you're gonna have between them, the more that you, like, treat the other group as a customer, you risk losing some of the, kind of, the cross pollination. Right? So you'll maybe miss out on some good data engineering idea that could have worked its way into the data science group or vice versa. But you're probably gonna ship more and ship faster.
So trying to find that balance and, you know, where your product maturity is at that time and and how that might affect where you wanna be in that balance,
[00:11:46] Unknown:
I think is kind of a constant struggle for anybody. And have you seen very many instances of people taking some of the philosophies from the DevOps movement and bringing them into the realm of data in terms of, sort of fostering that collaboration between the 2 groups to ensure that you don't create as much of that point of tension so that each side is trying to build up empathy and understanding for the needs of the other side to ensure that they're delivering the business value rather than focusing solely on their own responsibility and, you know, potentially to the detriment of the organization as a whole?
[00:12:26] Unknown:
Yeah. I mean, I think it's happening. I think it's maybe a a couple of years behind what you see in in the the DevOps area. There's a a number of projects or companies kind of working on this. Let's enable data scientists directly to put things into production. And I think for a lot of projects, that makes a lot of sense. But for things at, like, very, very high scales or things where you're dealing with external data, I think you're gonna always end up with, you know, 2 separate groups or 2 separate people or or whatever it is. And there will be this kind of almost negotiation on, you know, how much how much freedom do you let the data scientists have at the cost of engineering, at the cost of ops. And and I'm not sure that there's really a free lunch to be had there other than trying to be diligent about managing it and understanding that it is a trade off that, you know, you need to consciously make.
[00:13:24] Unknown:
And in your experience, have you found that there is any sort of consistency in terms of the size and scale of an organization or a problem domain that creates the tipping point where you start to need to separate the 2 roles into separate responsibilities or having more than 1 person on a given team versus having the data scientist do both the engineering and the exploratory aspects of it?
[00:13:51] Unknown:
So I I think the scale at which you wanna split them is pretty low, honestly. I mean, especially if if there's travel involved for the data scientists. So, I mean, if you're traveling every other week to go see a customer or to some business unit or something like that, I mean, it's just hard as a person to to be traveling and dealing with something closer to ops like data engineering. And like I said, the workflow is different enough that it can be kind of hard to to context switch between something kind of exploratory and iterative and kind of just cranking out more, I guess, normal engineering work. I don't necessarily think that one's really harder than another, but, at least for me, I I do a lot better if I have to do both things, you know, doing all of 1 on 1 day and all of 1 on the other day. It's a very different kind of head space to be in. So I would say even if you're a 1 person team, splitting kind of your understanding of the 2 types of tasks across different sprints or different days is, you know, beneficial and worth it. And beyond just the,
[00:15:05] Unknown:
ability to gain efficiencies by splitting those responsibilities, are there any other benefits that you've seen by separating the data engineering and the data science roles into separate sort of problem spaces or, job titles within a company? I think it can
[00:15:22] Unknown:
can kind of foster so separating these things out pulls you out of the weeds a little bit as an engineer, and you can be or a data scientist, and you can maybe more readily find parts of the stack that you could replace with something open source or some product or something like that, where if you're kind of getting into this hack mode where you're kind of doing the the science work, which is, you know, very broad and exploratory, and engineering work where you're kind of just serving your own descent down the rabbit hole. I I think it's a lot easier to end up with 1 off solutions. Right? So having that tension where there's somebody there saying, like, I'm an engineer, and I have to support this. We have to put some boundaries on things, I think is overall healthy.
[00:16:07] Unknown:
Yeah. And sometimes the best solutions to a problem occur because of these constraints that are being placed on the sort of capabilities of delivering a given solution. Yeah. Exactly. And have you noticed any particular disadvantages in having the responsibilities
[00:16:23] Unknown:
separated among multiple people? Yes. I mean, anytime you put some restriction on collaboration, which I guess at the end of the day this is, you're gonna risk, you know, missing out on some good ideas. So data science and data engineering, while they're very different kind of workflows, they're you're in the same domain. Right? You may be a little bit more specialized in 1 aspect than another, but a lot of really good data science work comes out of data engineering and vice versa. Right? So I think if you have a somewhat larger teams, it's really important to have, you know, at least 1 person that's showing up to planning meetings for both that's, you know, facilitating some free flow of communication between the 2. You don't wanna make engineers be in, you know, twice as many scrums or or whatever it is. But whether it's a product manager or just 1 engineer or data scientist, that's gonna be the go between keeping that open is gonna help reduce risk of, you know, a data science group beating their heads against a wall on something that data engineering has a solution for or or vice versa. And are there any particular
[00:17:27] Unknown:
strategies that you've seen to help ensure a successful collaboration between data engineers and data scientists or any particular tooling or platforms that you've seen that help to foster that relationship and make it easier for them to collaborate?
[00:17:44] Unknown:
So I think I think the the simplest and first thing to make sure you have is if you're a data engineering team, build out whatever you can to decouple data sciences work from your deployed pipeline. So if every time they want to put something new in a production, you have to do a deploy, that's really disruptive to both teams. So building, you know, config driven, well tested pipelines where they can change a config and you don't have to rerun a bunch of builds and maybe have downtime or whatever it is for your system. I think it's the first enabling step to kind of give both groups their own autonomy. And how do you view the roles evolving as they become more prevalent across more companies and more industries? So essentially, I mean, we've been in a lot of very large, maybe not traditional, software organizations and seen how they've done it. They're very, very different company to company, how they just organize data science, data engineering teams.
So I think 1 thing that I've kinda wrapped my head around, I guess, I I think Data Engineering will be centralized in an IT organization. So I have 1 big Data Engineering group that sits, like, within corporate, and maybe as they get bigger, they're starting to put individuals into business units to kind of go upstream in the data, but all of that will be centralized. The data science groups, I see basically every possible place in an organization they could sit. It might be corporate, it might be its own business unit, it might be all external, there might be 1 in every business unit. I think over time, we're gonna see data science groups align much more closely with the the end business units.
So you'll have a small data science group and, you know, every business unit of a big conglomerate,
[00:19:37] Unknown:
but 1 data engineering group. And 1 of the things that I've seen too is that as the principles and ideals behind data science and in particular machine learning and artificial intelligence become more common and more practiced, there are a lot of patterns that are emerging and tooling that is coming out to make it easier for people who aren't necessarily as well versed in the underlying statistics to be able to take those concepts and put them into production and be able to deliver value to the business. So I'm wondering if you see the role of the data scientist becoming more specialized as some of those tools come into the domain of the data engineer to be able to deploy those solutions on top of existing data without necessarily having to engage a dedicated data scientist for a particular problem. And then also in the reverse, a lot of these platforms for being able to collect and process data are becoming easier to run, and there are a number of cloud providers that are actually starting to produce managed solutions that make it easier for data scientists to have more of a, you know, 1 click deploy solution to be able to gain access to all the data that they need to be able to do their exploration and experimentation?
[00:20:50] Unknown:
Yeah. So that it's it's very interesting to me. I I think the root problem here is, I think, frankly, that data scientists are expensive. Right? And and there's a lot of value that they bring, but they're expensive. So you see 2 big pushes in industry. The on 1 end, there's this kind of idea that we should democratize data science and enable less technical people, like business analysts or people just in the regular business unit that are, like, maybe a power user of Excel or something like that, to apply machine learning in their existing roles. And then on the other end, you have this camp that's kind of saying, well, you know, just applying machine learning is not really the difficult part so much. So what can we do to need fewer data scientists? Right? So instead of saying how can I get more cheaply, how can I do more with the 5 or 10 or however many that I have? I tend to think that that is probably the approach that'll win out in the future. I think at the end of the day, the hardest things for a data scientist are, really, like, problem formulation, selling projects internally, convincing people that, you know, this this magical black box is not making things up. More soft skills, validation, things like that, than in just training a big regression model. So I tend to think that tools that enable, a highly qualified data scientist to do, you know, 10 or a 100 times more projects in a year versus tools that help somebody less qualified
[00:22:27] Unknown:
do 1 or 2 models a year will will win out in the long run. And 1 of the things too that you briefly highlighted is the fact of the expense involved in keeping a data scientist on staff. So I'm wondering if you have any, sort of anecdotal evidence of the relative cost or the relative salary brackets for data scientists, for this as data engineers, and how you've seen that evolve as as the roles have started to gain a more sort of Atlanta,
[00:23:10] Unknown:
relatively comparable in Atlanta, relatively comparable with the caveat that the base qualifications for data scientists tend to be higher. So a lot of places, it's all PhDs. So they're kind of starting out already mid career or in a in a little bit higher bracket. So if you just kinda, like, lop off the the new grad pay range from data scientists, That does still exist for for data engineers. And then after that, I think they're they're fairly comparable. But it's I don't know. It's hard to say because they're they're such murky titles in practice that data scientists could be anything from, you know, a business analyst to, you know, somebody pioneering research somewhere and data engineer could be anything from a BI analyst to, you know, core committer in Spark or something like that. It it it's always a complicated,
[00:24:04] Unknown:
question when you're trying to understand what the salary ranges are for a given job title again because they're so nebulous depending on who's writing the description and who's actually doing the work. And 1 other question that I have is in terms of the sort of portability of the skills where I'm wondering if as the sort of responsibilities of a data scientist becomes more comes to be in broader demand whether you see the sort of prevalence of full time data scientists within a company becoming less the norm and that being more of a sort of contract oriented role where a data scientist will come in, understand the needs of the business, work to try and understand what the data needs are of the business, and then be able to hand off some of those responsibilities to in house data engineers.
And if you see the data engineering role as being something that's more of a permanent fixture of a company in terms of maintaining their existing data systems and ensuring that they're, continuing to operate as needed?
[00:25:06] Unknown:
I I could absolutely see that. I mean, I I think data engineering is a pretty natural alignment with just normal IT operation. So I don't really see that moving to a a contract thing. For the most part, data scientists, the you kind of your goal as a data scientist, I think, is to become an expert at not having domain expertise. Right? Because if you if you lean too hard on the domain expertise, then at some point you're just an engineer in that domain who also knows machine learning. Right? So I think as the tooling that people use and the data systems kind of standardize a little bit more in, in the larger companies. I could see data science moving to a more kind of hired gun sort of scenario.
Right now, I think everybody's data pipelines and and the way that they're organizing data is so different company to company that you'd lose a ton of time just getting up to speed on everything. So I think things will stay in house for a little while. Are there any other aspects of this topic that we didn't explore yet that you think we should cover before we start to close
[00:26:13] Unknown:
out the show? No. I think we covered just about everything I thought about. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'm wondering from your perspective, what you view as being the biggest gap in the available tooling or technology for data management today?
[00:26:36] Unknown:
Yeah. So, I work pretty heavily in the Python scikit learn kind of ecosystem, and I would love to be able to neatly package a trained model, but also with all the dependencies for that model in a separate namespace. Because right now you can train a model, you could use joblib or pickle or something like that to serialize it. But as soon as you load it back, if you've got different versions of scikit learn and NumPy and scipy and all these other things, you kind of hope it'll work. And a lot of the time it will. But I think we're still lacking a really good way to reliably save old trained models and use them later. And in a a big recurring machine learning system, it's pretty critical so that you can
[00:27:22] Unknown:
version old models and fall back to them if something's wrong. So I think for me, that would be huge. Yeah. I think that 1 of the hopes is that Docker will continue to be sort of the panacea for that kind of problem area, but repeatability and reproducibility in computing in general has been sort of the the consistent bugaboo for anybody trying to actually produce any sort of software and run it in a production context. Yep. Yep. And especially with
[00:27:53] Unknown:
these scientific libraries, we were pulling in gotten so many dependencies and system libraries, all these things under the hood. Just kind of having everything of the right version at the right time over the span of a few years is
[00:28:06] Unknown:
very difficult. Absolutely. Alright. Well, I really appreciate you taking the time to share your thoughts on this subject area because it's definitely 1 that is important for a lot of people to be thinking about and trying to understand so that they can be effective in their roles. So thank you for taking your time and I appreciate the, work that you're doing. So I hope you enjoy the rest of your evening. Yeah. Thank you for having me. I enjoyed it.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.
We've got a couple of announcements before we start the show. There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata dash sand dash jose to register and save 20% off your tickets. The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices AI for business. Go to dataengineeringpodcast.com/aicondashnewdashyork to register and save 20% off the tickets.
Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018 and register. Your host is Tobias Macy. And today, I'm interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists. So, Will, could you start by introducing yourself? Yeah. Sure. Thanks for having me on.
[00:02:11] Unknown:
So my name is Will McGinnis. I'm the chief scientist at Predicto. We're a startup in Atlanta. We have a software platform that helps big industrial companies predict failures and large transportation assets. So planes, trains, cranes, that kind of thing. And we do that by taking in a huge amount of sensor data and maintenance data, all types of kind of messy, maybe not as managed as 1 would like data. And we, you know, we clean it up, we merge it, use our machine learning engine, and try and tell somebody what they need to do ahead of time. Other than that, I do some, open source work. So I'm a maintainer of categorical encoders
[00:02:54] Unknown:
to the scikodearn contrib project. And do you remember how you first got involved in the area of data management?
[00:03:00] Unknown:
Yeah. So I felt kind of into it, maybe, you could say backwards. My my educational background is mechanical engineering. So I did undergrad and graduate school in that. And my research was in trying to predict, wear based failures in different aerospace components with physics models. So we would, you know, go do these experiments and have, you know, huge amounts of time series data and try and build some model that had to predict it. And around when I was finishing that up, I met, the the founders of Predicto and ended up joining as first employee. So really early on, you know, I got to wear a ton of different hats. So I was kind of trying to do the machine learning part, but, you have to do all of the data management parts before that. So for the 1st couple of years, most of the work was really trying to build out a good data pipeline, data management, how are we gonna take in all these different kinds of data without, you know, going through a different data management process for every customer. And, you know, we learned a a ton as we went along, but we're kinda doing it on the fly.
[00:04:08] Unknown:
You recently wrote a blog post about the tendencies of data engineers and data scientists. And given the fact that you started off as the first hire at Predicto, I'm sure you got to wear both hats for quite a while before you had enough people that it made sense to actually split out those different responsibilities into separate roles. So to start with, I'm wondering if you can just explain your definition of the terms data scientist and data engineer given the fact that they're often very overloaded and people will use very fluid definitions when they're referring to each of those different roles.
[00:04:44] Unknown:
Yeah. Absolutely. I mean, I think they're they tend to be kind of fuzzy titles, but I try to think of them and really any job title in terms of the domain of work that you're doing and the the methods that you're using. Right? So think a civil engineer and mechanical engineer are both doing engineering work, but in 2 different domains. Data scientist and data engineer are in the same domain. They're both dealing with data problems, data analysis, data systems, but the style of work that they're doing is different. Engineering work and science work are, you know, very different in terms of how you manage them, how you scope, different things, how you require Gatherments, or if you even can. So the domain and the things that you're dealing with are very similar, and there's a lot of overlap. You know, I think most people do a little bit of both. But when it comes down to actually trying to manage your tasks and decide, you know, what am I gonna do this week and how am I gonna let people know what I did?
They're very, very different. So, I mean, I I think it applies to basically any, job, especially in software where I think we we like to pick a lot of different really granular, job titles often. I think picking tools, so somebody will say like, I'm a Hadoop engineer or something like that, which I think is maybe a little bit strange. But I think the main takeaway is that understanding the the workflow of science and data science kind of work versus engineering. So agile may not be, particularly useful for data scientists if you can't really scope the work that you're doing accurately enough to to be very reliable. That's the the first order
[00:06:26] Unknown:
of understanding for that. Yeah. I could agree that identifying your role by the tool you use is rather strange because you can kinda think of it as if a carpenter were to use the same approach, they could say, oh, I'm a hammer engineer and that person over there is the saw engineer where, you know, you you ultimately working towards the same goal, and it doesn't really make sense to be so granular in separating how you do your job.
[00:06:52] Unknown:
Right. Right. It's I I think it comes down to just both roles are dealing with data and care about data. At the end of the day, I think most of the time, customer of the data engineer is the data scientist, where the customer of data scientist is probably some business unit or some business owner. And, you know, day to day, how 1 organizes their tasks and decides what to do is, you know, a data scientist is probably popping things off of a queue and they're not really sure how long this task is gonna take or it's very iterative. Data engineers can kind of scope things up into more granular little tasks and,
[00:07:25] Unknown:
share them amongst their peers more easily. And when I was reading your post too, 1 of the things that stood out is that when people use the term data engineer, a lot of times they'll use that to refer to somebody who builds up the data pipeline and is responsible for all of the data plumbing, and the data scientist is usually the person who's seen as the 1 who's communicating with the business about what the data means and actually creating the presentations of it, but using the sort of description of the role as the engineer is the person who takes the domain and makes it predictable, it opens up the idea of that role to being able to be somebody who actually does create the front end of the presentation layer as well where you may be, stewarding the data all the way from collection through to presentation, but you're not necessarily doing the exploratory aspects of understanding where to find additional sources of data or creating new ways of interpreting that data. You're just working on making sure that all the different steps that the data takes is predictable and repeatable and that may even include being the person who creates that presentation to the final business user.
[00:08:33] Unknown:
Yeah. Exactly. And and, you know, I I I try to think of the separation between data scientists and data engineers also in terms of what physical thing they're going to to hand to 1 another. Right? So if a data scientist, you know, is gonna develop some model in Python and, you know, send over some untested script and kinda throw it over the wall and say, alright. Go figure this out. That's maybe not a great collaboration for anybody. Right? The the kind of joint role that they have is to as as 2 teams or 2 people, or even if you're just 1 person trying to separate things, is to make a a neat integration point where the data engineer's enabling the data scientists to do the analysis they need to do and put something into production, but they're not gonna completely upend their own data pipeline for it. Right? You don't wanna have to do a full deploy of something just because somebody needs to update a model. Right? So that takes kind of some advanced thought, and
[00:09:38] Unknown:
and you need to architect your system so that you know, data scientists can do their job without disrupting the data engineer's job. And 1 of the things that I was noticing too as I was reading through your post and preparing for our conversation today is I see a lot of parallels between the way that you describe the relationship of the data scientist and the data engineer and the same relationship that occurs between, developers and systems administrators where developers and data scientists are agents of change where they want to be able to iterate quickly on things and see the work that they do get released to production without necessarily more interested in more interested in restricting scope and creating reliability and consistency. And so there's there's this point of tension between the 2 roles and responsibilities where they're all ultimately working towards the same goal, but they're going about it from opposite ends. So I'm wondering if you have sort of drawn the same parallels or if you have any thoughts on that idea. Yeah. Absolutely. I mean, I think it's an extremely similar dynamic and 1 that comes up directly here as well. Right? So the if you have completely separate operations,
[00:10:52] Unknown:
data engineering, data science, and, like, application development teams, they're all gonna be in some kind of tension with each other. Right? Especially in the data scientist needs, the data engineer needs, operations kind of pipeline. And I don't know that in any 1 organization there's like a a real magic bullet there. The trade off is the more separation that you're gonna have between them, the more that you, like, treat the other group as a customer, you risk losing some of the, kind of, the cross pollination. Right? So you'll maybe miss out on some good data engineering idea that could have worked its way into the data science group or vice versa. But you're probably gonna ship more and ship faster.
So trying to find that balance and, you know, where your product maturity is at that time and and how that might affect where you wanna be in that balance,
[00:11:46] Unknown:
I think is kind of a constant struggle for anybody. And have you seen very many instances of people taking some of the philosophies from the DevOps movement and bringing them into the realm of data in terms of, sort of fostering that collaboration between the 2 groups to ensure that you don't create as much of that point of tension so that each side is trying to build up empathy and understanding for the needs of the other side to ensure that they're delivering the business value rather than focusing solely on their own responsibility and, you know, potentially to the detriment of the organization as a whole?
[00:12:26] Unknown:
Yeah. I mean, I think it's happening. I think it's maybe a a couple of years behind what you see in in the the DevOps area. There's a a number of projects or companies kind of working on this. Let's enable data scientists directly to put things into production. And I think for a lot of projects, that makes a lot of sense. But for things at, like, very, very high scales or things where you're dealing with external data, I think you're gonna always end up with, you know, 2 separate groups or 2 separate people or or whatever it is. And there will be this kind of almost negotiation on, you know, how much how much freedom do you let the data scientists have at the cost of engineering, at the cost of ops. And and I'm not sure that there's really a free lunch to be had there other than trying to be diligent about managing it and understanding that it is a trade off that, you know, you need to consciously make.
[00:13:24] Unknown:
And in your experience, have you found that there is any sort of consistency in terms of the size and scale of an organization or a problem domain that creates the tipping point where you start to need to separate the 2 roles into separate responsibilities or having more than 1 person on a given team versus having the data scientist do both the engineering and the exploratory aspects of it?
[00:13:51] Unknown:
So I I think the scale at which you wanna split them is pretty low, honestly. I mean, especially if if there's travel involved for the data scientists. So, I mean, if you're traveling every other week to go see a customer or to some business unit or something like that, I mean, it's just hard as a person to to be traveling and dealing with something closer to ops like data engineering. And like I said, the workflow is different enough that it can be kind of hard to to context switch between something kind of exploratory and iterative and kind of just cranking out more, I guess, normal engineering work. I don't necessarily think that one's really harder than another, but, at least for me, I I do a lot better if I have to do both things, you know, doing all of 1 on 1 day and all of 1 on the other day. It's a very different kind of head space to be in. So I would say even if you're a 1 person team, splitting kind of your understanding of the 2 types of tasks across different sprints or different days is, you know, beneficial and worth it. And beyond just the,
[00:15:05] Unknown:
ability to gain efficiencies by splitting those responsibilities, are there any other benefits that you've seen by separating the data engineering and the data science roles into separate sort of problem spaces or, job titles within a company? I think it can
[00:15:22] Unknown:
can kind of foster so separating these things out pulls you out of the weeds a little bit as an engineer, and you can be or a data scientist, and you can maybe more readily find parts of the stack that you could replace with something open source or some product or something like that, where if you're kind of getting into this hack mode where you're kind of doing the the science work, which is, you know, very broad and exploratory, and engineering work where you're kind of just serving your own descent down the rabbit hole. I I think it's a lot easier to end up with 1 off solutions. Right? So having that tension where there's somebody there saying, like, I'm an engineer, and I have to support this. We have to put some boundaries on things, I think is overall healthy.
[00:16:07] Unknown:
Yeah. And sometimes the best solutions to a problem occur because of these constraints that are being placed on the sort of capabilities of delivering a given solution. Yeah. Exactly. And have you noticed any particular disadvantages in having the responsibilities
[00:16:23] Unknown:
separated among multiple people? Yes. I mean, anytime you put some restriction on collaboration, which I guess at the end of the day this is, you're gonna risk, you know, missing out on some good ideas. So data science and data engineering, while they're very different kind of workflows, they're you're in the same domain. Right? You may be a little bit more specialized in 1 aspect than another, but a lot of really good data science work comes out of data engineering and vice versa. Right? So I think if you have a somewhat larger teams, it's really important to have, you know, at least 1 person that's showing up to planning meetings for both that's, you know, facilitating some free flow of communication between the 2. You don't wanna make engineers be in, you know, twice as many scrums or or whatever it is. But whether it's a product manager or just 1 engineer or data scientist, that's gonna be the go between keeping that open is gonna help reduce risk of, you know, a data science group beating their heads against a wall on something that data engineering has a solution for or or vice versa. And are there any particular
[00:17:27] Unknown:
strategies that you've seen to help ensure a successful collaboration between data engineers and data scientists or any particular tooling or platforms that you've seen that help to foster that relationship and make it easier for them to collaborate?
[00:17:44] Unknown:
So I think I think the the simplest and first thing to make sure you have is if you're a data engineering team, build out whatever you can to decouple data sciences work from your deployed pipeline. So if every time they want to put something new in a production, you have to do a deploy, that's really disruptive to both teams. So building, you know, config driven, well tested pipelines where they can change a config and you don't have to rerun a bunch of builds and maybe have downtime or whatever it is for your system. I think it's the first enabling step to kind of give both groups their own autonomy. And how do you view the roles evolving as they become more prevalent across more companies and more industries? So essentially, I mean, we've been in a lot of very large, maybe not traditional, software organizations and seen how they've done it. They're very, very different company to company, how they just organize data science, data engineering teams.
So I think 1 thing that I've kinda wrapped my head around, I guess, I I think Data Engineering will be centralized in an IT organization. So I have 1 big Data Engineering group that sits, like, within corporate, and maybe as they get bigger, they're starting to put individuals into business units to kind of go upstream in the data, but all of that will be centralized. The data science groups, I see basically every possible place in an organization they could sit. It might be corporate, it might be its own business unit, it might be all external, there might be 1 in every business unit. I think over time, we're gonna see data science groups align much more closely with the the end business units.
So you'll have a small data science group and, you know, every business unit of a big conglomerate,
[00:19:37] Unknown:
but 1 data engineering group. And 1 of the things that I've seen too is that as the principles and ideals behind data science and in particular machine learning and artificial intelligence become more common and more practiced, there are a lot of patterns that are emerging and tooling that is coming out to make it easier for people who aren't necessarily as well versed in the underlying statistics to be able to take those concepts and put them into production and be able to deliver value to the business. So I'm wondering if you see the role of the data scientist becoming more specialized as some of those tools come into the domain of the data engineer to be able to deploy those solutions on top of existing data without necessarily having to engage a dedicated data scientist for a particular problem. And then also in the reverse, a lot of these platforms for being able to collect and process data are becoming easier to run, and there are a number of cloud providers that are actually starting to produce managed solutions that make it easier for data scientists to have more of a, you know, 1 click deploy solution to be able to gain access to all the data that they need to be able to do their exploration and experimentation?
[00:20:50] Unknown:
Yeah. So that it's it's very interesting to me. I I think the root problem here is, I think, frankly, that data scientists are expensive. Right? And and there's a lot of value that they bring, but they're expensive. So you see 2 big pushes in industry. The on 1 end, there's this kind of idea that we should democratize data science and enable less technical people, like business analysts or people just in the regular business unit that are, like, maybe a power user of Excel or something like that, to apply machine learning in their existing roles. And then on the other end, you have this camp that's kind of saying, well, you know, just applying machine learning is not really the difficult part so much. So what can we do to need fewer data scientists? Right? So instead of saying how can I get more cheaply, how can I do more with the 5 or 10 or however many that I have? I tend to think that that is probably the approach that'll win out in the future. I think at the end of the day, the hardest things for a data scientist are, really, like, problem formulation, selling projects internally, convincing people that, you know, this this magical black box is not making things up. More soft skills, validation, things like that, than in just training a big regression model. So I tend to think that tools that enable, a highly qualified data scientist to do, you know, 10 or a 100 times more projects in a year versus tools that help somebody less qualified
[00:22:27] Unknown:
do 1 or 2 models a year will will win out in the long run. And 1 of the things too that you briefly highlighted is the fact of the expense involved in keeping a data scientist on staff. So I'm wondering if you have any, sort of anecdotal evidence of the relative cost or the relative salary brackets for data scientists, for this as data engineers, and how you've seen that evolve as as the roles have started to gain a more sort of Atlanta,
[00:23:10] Unknown:
relatively comparable in Atlanta, relatively comparable with the caveat that the base qualifications for data scientists tend to be higher. So a lot of places, it's all PhDs. So they're kind of starting out already mid career or in a in a little bit higher bracket. So if you just kinda, like, lop off the the new grad pay range from data scientists, That does still exist for for data engineers. And then after that, I think they're they're fairly comparable. But it's I don't know. It's hard to say because they're they're such murky titles in practice that data scientists could be anything from, you know, a business analyst to, you know, somebody pioneering research somewhere and data engineer could be anything from a BI analyst to, you know, core committer in Spark or something like that. It it it's always a complicated,
[00:24:04] Unknown:
question when you're trying to understand what the salary ranges are for a given job title again because they're so nebulous depending on who's writing the description and who's actually doing the work. And 1 other question that I have is in terms of the sort of portability of the skills where I'm wondering if as the sort of responsibilities of a data scientist becomes more comes to be in broader demand whether you see the sort of prevalence of full time data scientists within a company becoming less the norm and that being more of a sort of contract oriented role where a data scientist will come in, understand the needs of the business, work to try and understand what the data needs are of the business, and then be able to hand off some of those responsibilities to in house data engineers.
And if you see the data engineering role as being something that's more of a permanent fixture of a company in terms of maintaining their existing data systems and ensuring that they're, continuing to operate as needed?
[00:25:06] Unknown:
I I could absolutely see that. I mean, I I think data engineering is a pretty natural alignment with just normal IT operation. So I don't really see that moving to a a contract thing. For the most part, data scientists, the you kind of your goal as a data scientist, I think, is to become an expert at not having domain expertise. Right? Because if you if you lean too hard on the domain expertise, then at some point you're just an engineer in that domain who also knows machine learning. Right? So I think as the tooling that people use and the data systems kind of standardize a little bit more in, in the larger companies. I could see data science moving to a more kind of hired gun sort of scenario.
Right now, I think everybody's data pipelines and and the way that they're organizing data is so different company to company that you'd lose a ton of time just getting up to speed on everything. So I think things will stay in house for a little while. Are there any other aspects of this topic that we didn't explore yet that you think we should cover before we start to close
[00:26:13] Unknown:
out the show? No. I think we covered just about everything I thought about. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'm wondering from your perspective, what you view as being the biggest gap in the available tooling or technology for data management today?
[00:26:36] Unknown:
Yeah. So, I work pretty heavily in the Python scikit learn kind of ecosystem, and I would love to be able to neatly package a trained model, but also with all the dependencies for that model in a separate namespace. Because right now you can train a model, you could use joblib or pickle or something like that to serialize it. But as soon as you load it back, if you've got different versions of scikit learn and NumPy and scipy and all these other things, you kind of hope it'll work. And a lot of the time it will. But I think we're still lacking a really good way to reliably save old trained models and use them later. And in a a big recurring machine learning system, it's pretty critical so that you can
[00:27:22] Unknown:
version old models and fall back to them if something's wrong. So I think for me, that would be huge. Yeah. I think that 1 of the hopes is that Docker will continue to be sort of the panacea for that kind of problem area, but repeatability and reproducibility in computing in general has been sort of the the consistent bugaboo for anybody trying to actually produce any sort of software and run it in a production context. Yep. Yep. And especially with
[00:27:53] Unknown:
these scientific libraries, we were pulling in gotten so many dependencies and system libraries, all these things under the hood. Just kind of having everything of the right version at the right time over the span of a few years is
[00:28:06] Unknown:
very difficult. Absolutely. Alright. Well, I really appreciate you taking the time to share your thoughts on this subject area because it's definitely 1 that is important for a lot of people to be thinking about and trying to understand so that they can be effective in their roles. So thank you for taking your time and I appreciate the, work that you're doing. So I hope you enjoy the rest of your evening. Yeah. Thank you for having me. I enjoyed it.
Introduction and Announcements
Interview with Will McGinnis
Defining Data Scientist and Data Engineer Roles
Parallels with DevOps
Strategies for Collaboration
Evolving Roles and Tooling
Portability and Future of Data Roles
Closing Thoughts and Contact Information