Data Teams with Will McGinnis

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

We've got a couple of announcements before we start the show.

There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata

dash sand dash jose

to register and save 20% off your tickets.

The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices

AI for business.

Go to dataengineeringpodcast.com/aicondashnewdashyork

to register and save 20% off the tickets.

Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th.

It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective.

To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018

and register.

Your host is Tobias Macy. And today, I'm interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists.

So, Will, could you start by introducing yourself? Yeah. Sure. Thanks for having me on.

So my name is Will McGinnis.

I'm the chief scientist at Predicto.

We're a startup in Atlanta.

We have a software platform that helps big industrial companies predict failures and large transportation assets. So

planes, trains, cranes,

that kind of thing.

And we do that by taking in a huge amount of sensor data and maintenance data,

all types

of kind of messy,

maybe

not as managed as 1 would like data.

And we, you know, we clean it up, we merge it, use our machine learning engine, and try and tell somebody what they need to do ahead of time. Other than that, I do some, open source work. So I'm a maintainer of categorical

encoders

to the scikodearn contrib project. And do you remember how you first got involved in the area of data management?

Yeah. So I felt kind of into it,

maybe, you could say backwards. My my educational background is mechanical engineering. So I did undergrad and graduate school in that.

And my research was in trying to predict,

wear based failures in different aerospace components with physics models. So we would, you know, go do these experiments and have, you know, huge amounts of time series data and try and build some model that had to predict it. And around when I was finishing that up,

I met,

the the founders of Predicto

and ended up joining as first employee. So really early on, you know, I got to wear a ton of different hats. So I was kind of trying to do the machine learning part, but, you have to do all of the data management parts before that. So for the 1st couple of years, most of the work was really

trying to build out a good

data pipeline, data management, how are we gonna take in all these different kinds of data without, you know, going through a different data management process for every customer. And, you know, we learned a a ton as we went along, but we're kinda doing it on the fly.

You recently wrote a blog post about the tendencies of data engineers and data scientists.

And given the fact that you started off as the first

hire at Predicto, I'm sure you got to wear both hats for quite a while before you had enough people that it made sense to actually split out those different responsibilities into separate roles.

So to start with, I'm wondering if you can just explain

your

definition

of the terms data scientist and data engineer given the fact that they're often very overloaded and people will use very fluid definitions when they're referring to each of those different roles.

Yeah. Absolutely. I mean, I think they're they tend to be kind of fuzzy titles,

but I try to think of them and really any job title in terms of the domain of work that you're doing and the the methods that you're using. Right? So

think a civil engineer and mechanical engineer are both doing engineering work, but in 2 different domains.

Data scientist and data engineer are in the same domain.

They're both dealing with data problems, data analysis, data systems,

but the style of work that they're doing is different. Engineering work and science work

are,

you know, very different in terms of how you manage them, how you scope,

different things,

how you require Gatherments, or if you even can. So

the domain and the things that you're dealing with are very similar, and there's a lot of overlap. You know, I think most people do a little bit of both. But when it comes down to actually trying to manage your tasks and decide, you know, what am I gonna do this week and how am I gonna let people know what I did?

They're very, very different. So,

I mean, I I think it applies to basically any,

job, especially in software where I think we we like to pick a lot of different really granular,

job titles often. I think picking tools, so somebody will say like, I'm a Hadoop engineer or something like that, which I think is maybe a little bit strange. But

I think the main takeaway is that understanding

the the workflow of science and data science kind of work

versus engineering. So

agile may not be,

particularly useful for data scientists

if you can't really scope the work that you're doing accurately enough to to be very reliable. That's the

the first order

of understanding for that. Yeah. I could agree that identifying your role by the tool you use is rather strange because you can kinda think of it as if a carpenter were to use the same approach, they could say, oh, I'm a hammer engineer and that person over there is the saw engineer

where,

you know, you you ultimately working towards the same goal, and it doesn't really make sense to be so granular in separating how you do your job.

Right. Right. It's

I I think it comes down to just both roles are dealing with data and care about data. At the end of the day, I think most of the time, customer of the data engineer is the data scientist,

where the customer of data scientist is probably some business unit or some business owner. And,

you know, day to day, how 1

organizes their tasks and decides what to do is, you know, a data scientist is probably popping things off of a queue and they're not really sure how long this task is gonna take or it's very iterative.

Data engineers can kind of scope things up into more granular little tasks and,

share them amongst their peers more easily. And when I was reading your post too, 1 of the things that stood out is that when people use the term data engineer, a lot of times they'll use that to refer to somebody who builds up the data pipeline and is responsible for all of the data plumbing,

and the data scientist is usually the person who's seen as the 1 who's communicating with the business about what the data means and actually creating the presentations of it, but using the sort of description of the role as the engineer is the person who

takes the

domain

and makes it predictable,

it

opens up the idea of that role to being able to be somebody who actually does create the front end of the presentation layer as well where you may be,

stewarding the data all the way from collection through to presentation,

but you're not necessarily doing the exploratory

aspects of understanding

where to find additional sources of data or creating new ways of interpreting that data. You're just working on making sure that all the different steps that the data takes is predictable and repeatable and that may even include being the person who creates that presentation to the final business user.

Yeah. Exactly. And

and, you know, I I I try to think of the separation

between data scientists and data engineers also in terms of what physical thing they're going to to hand to 1 another.

Right? So if a data scientist,

you know, is gonna develop some model in Python and,

you know, send over some

untested script and kinda throw it over the wall and say, alright. Go figure this out. That's maybe not a great collaboration for anybody.

Right?

The the kind of joint role that they have is to as as 2 teams or 2 people, or even if you're just 1 person trying to separate things,

is to make a a neat integration point where

the

data engineer's enabling the data scientists

to do the analysis they need to do and put something into production,

but they're not gonna

completely upend their own data pipeline for it. Right? You don't wanna have to do

a full deploy of something

just because somebody needs to update a model. Right? So that takes kind of some advanced thought, and

and you need to architect your system so that you know, data scientists can do their job without disrupting the data engineer's job. And 1 of the things that I was noticing too as I was reading through your post and preparing for our conversation today is I see a lot of parallels between the way that you describe the relationship of the data scientist and the data engineer

and the same relationship that occurs between,

developers and systems administrators where developers and data scientists are agents of change where they want to be able to iterate quickly on things and see the work that they do get released to production

without necessarily

more interested in more interested

in restricting scope and creating reliability and consistency. And so there's there's this point of tension between the 2 roles and responsibilities

where they're all ultimately working towards the same goal, but they're going about it from opposite ends. So I'm wondering if you have sort of drawn the same parallels or if you have any thoughts on that idea. Yeah. Absolutely. I mean, I think it's an extremely similar dynamic and 1 that comes up directly here as well. Right? So the if you have completely separate operations,

data engineering, data science, and, like, application development teams, they're all gonna be in some kind of tension with each other. Right? Especially in the data scientist needs, the data engineer needs,

operations

kind of pipeline.

And I don't know

that in any 1 organization there's like a a real magic bullet there. The trade off is the more separation

that you're gonna have between them,

the more that you, like, treat the other group as a customer, you risk losing some of the, kind of, the cross pollination.

Right? So you'll maybe miss out on some good data engineering

idea that could have worked its way into the data science group or vice versa. But you're probably gonna ship more and ship faster.

So

trying to find that balance and, you know, where your product maturity is at that time and and how that might affect where you wanna be in that balance,

I think is kind of a constant struggle for anybody. And have you seen very many instances of people taking some of the philosophies from the DevOps movement and bringing them into the realm of data in terms of,

sort of fostering that collaboration between the 2 groups to ensure that you don't create as much of that point of tension so that each side

is trying to build up empathy and understanding for the needs of the other side to ensure that they're delivering the business value rather than focusing solely on their own

responsibility

and, you know,

potentially to the detriment of the organization as a whole?

Yeah. I mean, I think it's happening. I think it's maybe a a couple of years behind what you see in in

the the DevOps area.

There's

a a number of

projects or companies kind of working on this. Let's

enable

data scientists directly to put things into production.

And I think for a lot of projects, that makes a lot of sense. But for things at, like, very, very high scales or things where you're dealing with external data, I think you're gonna always end up with,

you know, 2 separate groups or 2 separate people or or whatever it is. And there will be this kind

of almost negotiation

on,

you know, how much

how much freedom do you let the data scientists have at the cost of engineering, at the cost of ops. And and I'm not sure that there's

really a free lunch to be had there other than trying to be diligent about managing it and understanding that it is a trade off that, you know, you need to consciously make.

And in your experience, have you found that there

is any sort of consistency in terms of the size and scale of an organization or a problem domain that creates the tipping point where you start to need to separate the 2 roles

into separate responsibilities or having more than 1 person on a given team versus having the data scientist do both the engineering and the

exploratory aspects of it?

So I I think

the scale at which you wanna split them is

pretty low,

honestly. I mean, especially if if there's travel involved for the data scientists.

So, I mean, if you're traveling every other week to go see a customer or

to some business unit or something like that, I mean, it's just hard as a person to to be traveling and dealing with something closer to ops like data engineering.

And

like I said, the workflow is different enough that it can be kind of hard to

to context switch between something kind of exploratory and iterative

and kind of just cranking out more, I guess, normal engineering work. I don't necessarily think that one's really

harder than another, but,

at least for me, I I do a lot better if I have to do both things, you know, doing all of 1 on 1 day and all of 1 on the other day. It's a very different kind of head space to be in. So I would say even

if you're a 1 person team,

splitting kind of your understanding of the 2 types of tasks across

different sprints or different days is, you know, beneficial and worth it. And beyond just the,

ability to gain efficiencies by splitting those responsibilities, are there any other benefits that you've seen by separating the data engineering and the data science roles into separate sort of problem spaces or, job titles within a company? I think it can

can kind of foster

so separating these things out pulls you out of the weeds a little bit as an engineer, and you can be or a data scientist, and you can maybe more readily find

parts of the stack that you could replace with something open source or some product or something like that, where

if you're kind of getting into this hack mode where you're kind of doing the the science work, which is, you know, very broad and exploratory, and engineering work where you're kind of just serving your own descent down the rabbit hole. I I think it's a lot easier to end up with 1 off solutions.

Right? So having that tension where

there's somebody there saying, like, I'm an engineer, and I have to support this. We have to put some boundaries on things, I think is

overall healthy.

Yeah. And sometimes the best solutions to a problem occur because of these constraints that are being placed on the sort of capabilities

of delivering a given solution. Yeah. Exactly. And have you noticed any particular disadvantages

in having the responsibilities

separated among multiple people? Yes. I mean, anytime you put some restriction on

collaboration, which I guess at the end of the day this is, you're gonna risk,

you know, missing out on some good ideas. So

data science and data engineering, while they're very different

kind of workflows,

they're you're in the same domain. Right? You may be a little bit more specialized in 1 aspect than another, but a lot of really good data science work comes out of data engineering and vice versa. Right?

So

I think if you have

a somewhat larger teams, it's really important to have, you know, at least 1 person that's showing up to planning meetings for both that's, you know, facilitating

some free flow of communication between the 2. You don't wanna make engineers be in, you know, twice as many scrums or or whatever it is. But whether it's a product manager or just 1 engineer or data scientist, that's gonna be the go between keeping

that open

is gonna help reduce risk of, you know, a data science group beating their heads against a wall on something that data engineering has a solution for or or vice versa. And are there any particular

strategies that you've seen to help ensure

a successful

collaboration between data engineers and data scientists or any particular tooling or platforms that you've seen

that help to foster that relationship

and make it easier for them to collaborate?

So I think

I think the the simplest and first thing to make sure you have

is if you're a data engineering team,

build out whatever you can to decouple

data sciences work from your

deployed pipeline.

So if every time they want to put something new in a production, you have to do a deploy, that's really disruptive to both teams.

So

building, you know, config driven, well tested

pipelines where they can change a config and you don't have to rerun a bunch of builds and maybe have downtime or whatever it is for your system. I think it's the first enabling step to kind of give both groups their own autonomy. And how do you view the roles evolving as they become more prevalent across more companies and more industries?

So essentially, I mean, we've been in a lot of very large, maybe not traditional,

software organizations and seen how they've done it. They're very, very different company to company, how they just organize

data science, data engineering teams.

So I think

1 thing that

I've kinda wrapped my head around, I guess, I I think Data Engineering will be centralized

in an IT organization.

So I have 1 big Data Engineering group

that sits, like, within corporate, and

maybe as they get bigger, they're starting to put individuals into business units to kind of go upstream in the data, but all of that will be centralized.

The data science groups, I see

basically every possible place in an organization they could sit. It might be corporate, it might be its own business unit,

it might be all external, there might be 1 in every business unit. I think over time, we're gonna see data science groups

align much more closely with

the the end business units.

So you'll have a small data science group and, you know, every business unit of a big conglomerate,

but 1 data engineering group. And 1 of the things that I've seen too is that

as the principles

and ideals behind data science

and in particular machine learning and artificial intelligence

become more common and more practiced, there are a lot of patterns that are emerging and tooling that is coming out to make it easier for people who aren't necessarily as well versed in the underlying statistics to be able to

take those concepts and put them into production and be able to deliver value to the business.

So I'm wondering if you see

the role of the data scientist becoming more specialized

as some of those tools

come into the domain of the data engineer to be able to deploy those solutions on top of existing data without necessarily having to engage a dedicated data scientist for a particular problem. And then also

in the reverse, a lot of these

platforms for being able to collect and process data are becoming easier to run, and there are a number of cloud providers that are actually starting to produce

managed solutions that make it easier for data scientists

to have more of a, you know, 1 click deploy solution to be able to gain access to all the data that they need to be able to do their exploration and experimentation?

Yeah. So that

it's it's very interesting to me. I I think the root problem here is, I think, frankly, that data scientists are expensive. Right? And and there's a lot of value that they bring, but they're expensive. So you see 2 big pushes

in industry.

The on 1 end, there's this kind of idea that we should democratize data science and enable less technical people,

like business analysts or people just in the regular business unit that are, like, maybe a power user of Excel or something like that, to apply machine learning in their existing roles. And then on the other end, you have this camp that's kind of saying, well, you know, just applying machine learning is not really the difficult

part so much. So what can we do to

need fewer data scientists? Right? So

instead of saying how can I get

more cheaply,

how can I do more with

the 5 or 10 or however many that I have? I tend to think that that is

probably

the approach that'll win out in the future. I think at the end of the day, the hardest things for a data scientist are,

really, like, problem formulation,

selling projects internally, convincing people that, you know, this this

magical black box is not making things up. More soft skills, validation,

things like that, than

in just training a big

regression model. So

I tend to think

that tools that enable,

a highly qualified data scientist to do, you know, 10 or a 100 times more projects in a year

versus

tools that help

somebody less qualified

do 1 or 2 models a year will will win out in the long run. And 1 of the things too that you briefly highlighted is the fact of the expense involved in keeping a data scientist on staff.

So I'm wondering if you have any,

sort of anecdotal evidence of the relative cost or the relative salary brackets for data scientists, for this as data engineers,

and how you've seen that evolve as as the roles have started to gain a more sort of Atlanta,

relatively comparable in Atlanta,

relatively comparable

with the caveat that the base qualifications

for

data scientists tend to be higher. So a lot of places, it's all PhDs.

So they're kind of starting out already mid career

or in a in a little bit higher bracket. So if you just kinda, like, lop off the

the new grad pay range from data scientists,

That does still exist for for data engineers.

And then after that, I think they're they're fairly comparable. But it's

I don't know. It's hard to say because they're they're such murky titles in practice that

data scientists could be anything from, you know, a business analyst to,

you know, somebody

pioneering research somewhere

and data engineer could be anything from a BI

analyst to, you know, core committer in Spark or something like that. It it it's always a complicated,

question when you're trying to understand what the salary ranges are for a given job title again because they're so nebulous

depending on who's writing the description and who's actually doing the work. And 1 other question that I have is in terms of the sort of portability of the skills where I'm wondering if as the sort of responsibilities of a data scientist

becomes

more

comes to be in broader demand

whether you see

the sort of prevalence of full time data scientists within a company becoming less the norm

and that being more of a sort of contract oriented role where a data scientist will come in, understand the needs of the business,

work to try and understand

what the

data needs are of the business, and then be able to hand off some of those

responsibilities to in house data engineers.

And if you see

the data engineering role as being something that's more of a permanent fixture of a company in terms of maintaining their existing data systems and ensuring that they're, continuing to operate as needed?

I I could absolutely see that. I mean, I I think data engineering

is a pretty natural alignment with

just normal IT operation. So I don't really see that moving to a a contract thing. For the most part, data scientists,

the you kind of

your goal as a data scientist, I think, is to become an expert at not having domain expertise.

Right? Because if you if you lean too hard on the domain expertise, then at some point you're just an engineer in that domain who also knows machine learning. Right? So I think as the

tooling that people use and the data systems kind of standardize a little bit more

in,

in the larger companies. I could see data science moving to a more kind of hired gun sort of scenario.

Right now, I think everybody's

data pipelines and and the way that they're organizing data is

so different company to company

that you'd lose a ton of time just getting up to speed on everything. So I think things will stay in house for a little while. Are there any other aspects of this topic that we didn't explore yet that you think we should cover before we start to close

out the show? No. I think we covered just about everything I thought about. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question,

I'm wondering from your perspective,

what you view as being the biggest gap in the available tooling or technology for data management today?

Yeah. So,

I work pretty heavily in the Python scikit learn kind of ecosystem,

and I would love to be able to neatly package a trained model, but also with all the dependencies for that model in a separate namespace.

Because right now you can train a model, you could use joblib or pickle or something like that to serialize it.

But as soon as you load it back, if you've got different versions of

scikit learn and NumPy and scipy and all these other things, you kind of hope it'll work. And a lot of the time it will.

But I think we're still lacking

a really good way to

reliably

save old trained models and use them later. And

in a a big recurring machine learning system, it's pretty critical so that you can

version old models and fall back to them if something's wrong. So I think for me, that would be huge. Yeah. I think that 1 of the hopes is that Docker will continue to be sort of the panacea for that kind of problem area, but

repeatability and reproducibility

in computing

in general has been sort of the the consistent

bugaboo for

anybody trying to actually produce any sort of software

and run it in a production context. Yep. Yep. And especially with

these scientific libraries, we were pulling in gotten so many dependencies

and system libraries, all these things under the hood. Just kind of having everything of the right version at the right time

over the span of a few years is

very difficult. Absolutely. Alright. Well, I really appreciate you taking the time to share your thoughts on this subject area because it's definitely 1 that is

important for a lot of people to be thinking about and trying to

understand

so that they can be effective in their roles. So thank you for taking your time and I appreciate the, work that you're doing. So I hope you enjoy the rest of your evening. Yeah. Thank you for having me. I enjoyed it.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links