Machine Learning In The Enterprise

Hello. Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And And don't forget to go to data engineering podcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Kevin DeWalt about his experiences at Prolego, building machine learning projects for Fortune 500 Companies. So, Kevin, could you start by introducing yourself? Well, thanks a lot, Tobias. And first, let me just thank you for being on the show. You've got a great podcast here. Our our team is subscribers, and we're fans. And it's really honored to be here with you today. So thank you. I'm Kevin DeWalt, cofounder of Prolego. We are a boutique consulting company which helps large organizations

achieve business goals with AI and machine learning through 3 practices.

I will do training, strategy, and then the 1 we'll spend most of our time talking on today, which is model development and deployment. And do you remember how you first got involved in the area of data management? Yeah. I've been in tech for a couple of decades

now as, you know, sort of as engineer,

startup founder, and, an investor, both angel and venture capitalist.

So,

so 1 way or another, I've been doing this for, I don't know, close to 30 years now. Couple of maybe career items of note that are relevant to what I do today. In the mid 19 nineties, I was a graduate student at Stanford, and I did my work in neural networks under, doctor Bernard Woodrow. And at the time, I was a young graduate student. I thought, wow.

This is this is great. Machines that think, I'm gonna solve all these amazing problems. Of course, you know, about 3 weeks into the coursework, I realized it was in the middle of the AI winter.

All the exciting things that I wanted to do weren't possible.

So now 24 years later, I finally get to apply my graduate degree and actually do a real work with neural networks and machine learning. Couple other relevant career experiences. I ran the, machine learning infrastructure at FINRA, which is the right US regulator for the Nasdaq stock market, and I was on the original deal team for the first investment by the US intelligence community into Palantir. So 1 way or another, most of my career, I've been involved in data management in some form or another. And so for the benefit of software engineers

and team leaders

and heads of organization who are new to the concept of machine learning and artificial intelligence, can you give a bit of an overview as to what machine learning is, why it's relevant to them, and maybe give a bit of differentiation

or clarification

as to the differences between machine learning and artificial intelligence,

at least as far as we'll use them within this conversation? Yeah. Great question.

And 1 of the challenges we run into is people use these terms so differently. You kinda have to check with your audience before you start having a conversation about them. So the here's how we use these terms with our clients.

AI for us just means

intelligent software. So it's about a specific 1 as using the term of the Internet. Right? And you can't you can't start an Internet project anymore. Although, I guess, you could 20 years ago, but it just means thinking computers. And so I really think of AI that's just a symbol that represents the era of the time that we're in. It's a broad,

a broad, term for all kinds of technology. And sometimes that's really useful. Right? If you're talking to a general audience, you wanna be general about what you do. Machine learning is a more specific term and a more useful term, and it describes a different way of building software.

We see people

using the term machine learning when they when they mean something else. So I'm gonna be very specific on my definition here by first explaining what machine learning isn't. So

99%

of all software in the world is built by software engineers

telling a computer explicitly what to what to do. And so software engineers are smart people, and they can read a spec or talk to a customer or look at a log file and and generalize what needs to be done and instructions that tells a computer how to do something. Right? And that's how most software is built. And machine learning is a different way of programming computers.

Instead of having the software engineer explicitly tell a computer what to do,

the engineer uses data to teach a model how to perform a specific task.

So the engineering and machine learning involves picking the right model and organizing the data in such a way that you get a computer to do what it wants to do.

And, you know, for the purest out there, this is a technique called this is supervised learning that I'm describing, which is overwhelmingly the type of machine learning that's put into

production today,

and the most relevant 1 for probably most of your audience. So

peep when people use the term machine learning, it has an

of a different way of building technology that requires a different way of working, different skill sets, and a different infrastructure.

As far as, you know, why is it relevant, I mean, most engineering projects

don't require machine learning. You know, regular old software development is great for solving most problems. Developers are smart people. And if you have a a relatively

straightforward

technical challenge doesn't that isn't,

that doesn't vary a lot, then, you know, writing a simple heuristic is great. You don't need machine learning

to implement the rule.

You know, if x is greater than 10 and less than 20, do y. Right? You're better off just writing some straightforward heuristics that that generalize the world and get the computer to do what you wanna do. Machine learning is becoming more and more popular because the world's getting more complex.

And as we get more and more data, as we more and more rely on computers for making decisions, and as we increasingly

demand accuracy

and and and precision from what our computers do,

machine learning becomes a better technique for getting computers to do what you wanna do. And so, generally, that's the difference between the 2. So,

Prolego,

your primary mission

is focused on helping larger organizations

get started with their first machine learning projects and help them establish the technical capacity to be able to build and maintain these projects in an ongoing basis. So and so

can you describe how you identify potential clients and then execute on establishing a presence in the market that you're targeting?

So, my my business partner and I, Russ Rands, we started the company about 2 years ago, and we both recognized,

or at least at the time, we had a hypothesis that large companies were gonna need this technology. And in my case, that was a it was a pretty specific recognition after my last start ups. The last start up I I I did before ProLingo was a product company,

and we were trying to do, machine learning for content marketing. It was a terrible idea. I can say that because it was my idea. It didn't fit with the market needs. The data didn't match.

In in many reasons, it just wasn't a wasn't a good space at the time for machine learning. But as a result of that experience, I would go in and talk to executives at big companies, And, you know, the long and short of the conversation was, Kevin, we like you. You're great, but your product is crap. But since you're in our office, can you tell me about this AI stuff? And what we both started to realize

was that neither of us had seen such a a

demand, concern, enthusiasm about a technology shift since since the Internet in 1995, 96. And,

for those in your audience who were in their careers when that was happening, you know, the Internet was met with the same amount of

confusion and enthusiasm

as you have today. I mean, you know, people would literally ask me, you know, what is the Internet and and where is it? And we we can't imagine the world without the Internet today, but there was a time when big companies were trying to figure out what do we do about this. You know, where does it live in the organization? Where does it go? What how can we make money with it? You know, what does it cost? And I began to see the same type of questions

coming out of large non tech companies.

So for us, they're really the the the prime we spent the 1st 6 months

out talking to the market and talking to executives about these type of problems. And what we started to realize was that the companies that were gonna most need help from a company like ours

were the,

non large non tech companies. So banks, insurance companies,

industrials,

manufacturing,

automotive, the companies that don't have a strong

existing technical infrastructure resource. And so, so we began building relationships,

you know, with all those executives

and by helping them figure out what this stuff is and what they can do with it. As far as, like, how do we establish a presence, I mean, the market is still pretty early,

and, there's not there really are not a ton of companies like ours, at least that we have met. Most

most big companies

are just now started to bring their machine learning models online. 2017, the way I describe it, is the year of executives going to conferences.

2018 was the year of, like, actually putting some budget in place and building a team. And now in 2019, they're starting to ramp up and begin deploying models into production.

Yeah. And I recently went to the Strata Conference

last year, and I was struck by the fact that a lot of the presentations were focused on actual practical production uses for these different data oriented technologies, whereas

in the preceding years, a lot of the conversation I had had was more in terms of, let's just collect all of this data because eventually somebody will know what to do with it. And now it's much more focused on, we actually have a business case for collecting this data and performing these analyses on it. And maybe we don't actually want all that data that we've been collecting because of some of these new regulatory environments.

Yeah. Boy, a lot I can say about that, including all of the money and time wasted, you know, building the data, building this mass amount of data, and then discovering that a lot of it either you can't use or that it's not as valuable as you're hoping it to be. But, yeah, that's actually it. We see that a lot. But, yeah, you're right. Now it's a lot more focused. Companies are getting a lot more specific in what they wanna do. And when you're speaking with your different clients and sales prospects, how much of your time and effort is spent on educating them about what machine learning and artificial intelligence are and the capabilities that they provide? I would say,

every step of what we do is education. So it really is the foundation of our entire business, and I think that part of that might just be that this is just this is how Russ and I work, and this is our personalities. But, typically, in a, you know, in a typical sales process,

it'll start with either, you know, a a conversation

or an executive will have read our our book. And they're, you know, they're they're coming to us because they're trying to get started with this or they're frustrated with the the situation within their organization.

18 months ago,

the questions were, hey. What is this AI stuff, and what can I do with it? Now we're increasingly getting to the, you know, the topics that you cover in your your wonderful podcast here, which is

we're not getting what we were hoping to get out of our investment in this. We're not really deploying things into production. You know, we we've got a lot of, like, you know, infrastructure challenges. You know, what can we do to overcome those? So the education part of it has shifted. It typically does does still start with some sort of conversation. After that, you know, we'll we'll do a workshop or webinar. Sometimes we'll come into a client site, just train them on the basics of of some of the technology,

for a day. And after that, it's a, you know, it's a strategy engagement, typically, and then we go into, the model development and deployment. Really, across the entire end to end,

you know, people are hiring us not to not to just execute on something for them, but to help them become a better organization and bring on the right people in the right process to do things themselves.

And

when building a new machine learning project, particularly from scratch where you don't have a lot of the

necessary infrastructure in place ahead of time,

what have you found to be the necessary set of technical skills and technical capacity and operating environment for being able to be successful in building and deploying these models. So to get started, there are 3

specific skills that we suggest our clients begin ramping up in, and that's AI product management,

which is just a product manager like you would think in a normal context, except very specifically and focused on getting models built, identifying data, working through legal issues, kind of the wrangling of all the challenges you have to do something,

in a, you know, in a service side environment where you're building probabilistic models. So there's that's 1 skill.

Those people don't exist. It usually involves taking some,

you know, smart product managers in your organization, maybe with statistics or an engineering background, and and and teaching them on the job how to do it. And the the second scale, the 1 that most people talk about is the the data science side of things, the machine learning engineer, the people that really build the models.

And,

those are the folks that go out and gather the data,

work with Jupyter Notebooks, do the testing, the evaluation, put the models together, and try to hit some sort of, benchmark.

And the 3rd scale and the most misunderstood,

and the hardest to find,

and the 1 you do such a great job of covering is the data engineering.

That's the process of actually that the person is going to take the models and put them into production, ensure they're reliable, they're doing what you want, and you have the infrastructure to be able to scale and iterate on them. And so those are the 3 kind of skills we look for in the beginning.

It's rare that a client has all 3 of those ready to go. They typically have pockets of these skills that are deployed across the organization.

So we're usually involved in helping them figure out how they can either reorganize to bring a team together to execute on 1 of these projects or bring in people to fill the gaps necessary for areas where they need help. And as you were saying, it can be difficult to find a lot of these skill sets either internally or even as external candidates. So wondering if you can talk a bit about your experiences

of identifying

people who don't necessarily

come

with the prepackaged

set set of skills, but have the necessary background or enthusiasm

to be able to train them effectively and bring them up to speed in being able to help build and manage these projects?

Yeah. Great question. I think 1 of the biggest mistakes we see is that,

you know, like, a HR recruiter or a hiring manager wants to go and fill a data engineering position or or, you know, a data science position, they go out and copy Google's job req. Right? And so what you end up doing is trying to compete,

with the same people at Google and Apple and Facebook are trying to compete for. It's really what like, oh, hiring. It's really just worth understanding what are the key skills you need, and and what kind of people can you attract

and bring to your organization to be able to fill those.

So there there are a lot of people that wanna get into this industry and a lot of people who are learning on their own. And if you package the opportunity the right way, you can attract those people. So my first piece of advice for anybody out who is hiring is to sell the job. I I'm shocked by how

little effort most companies put into making you know, selling the opportunity to, prospective candidates. I know there's there's legal and HR reasons for this, but if you're putting together a new team that you're gonna try to use to transform your business, you know, tell people that. I don't know if, Tobias, if you saw any of the GE commercials when they were trying to rebrand themselves as a software company. There were TV commercials they ran. Did you ever happen to see any of those? Did not know. You know, they it was a campaign they ran last year sometime, and, unfortunately, GE has run into so many General Electric has run into so many other problems since then. But they really went out of a way to rebrand themselves as software company by showing software engineers,

you know, chatting over a barbecue about how they're all the great engineering work they're doing at General Electric. And, I mean, that's, you know, I'm not suggesting anybody needs to do Super Bowl commercials, but when you're writing your when you're writing your job requirements,

make it clear that your company is changing, that you're gonna be different than you're looking for talent to be able to do this type of work. So that's just my sort of general recommendation on this front. As far as the specific skill sets, on the machine learning side, again, 1 of the worst things you can do is is copy Google Google's job recommendation. It really depends on what you're doing. If you're doing if trying to advance the state of the art in a particular area that's critical, if it's computer vision or NLP and you really need people who are that come from a research background, then you're gonna have to get somebody, you know, with a PhD who's done primary research and built models. And there's a lot of people out there who have physics backgrounds and, you know, scientific and economic backgrounds that are really trying to get out of this. Maybe they don't like their academic advisor or they decided academia is not for them or they they've heard data science is the way to go. So a lot of people looking for those opportunities. You wanna find people who are good programmers

and people who have some example of some history of shipping products. So,

an economist with a, you know, a a PhD

with some Python skills,

if the the person can demonstrate

good programmer and they know how to work in an operational environment, then that person is probably a good choice.

On the data engineering side, as you all know, the challenge that you're dealing with is that, you're competing with every organization that's trying to hire

DevOps and infrastructure people. And, you know, as as both of us know, there's there's a real shortage of that, and and the prices for experienced people has really gone through the roof. So try to find people who have all the basic technical skills to be able to do this. They've got good server side Python skills.

They have basic DevOps skills that that DevOps experience. They've worked with Jenkins, worked with Git. But 1 of the ones we run into is make sure they can do SQL. When I started my career, everybody worked with databases, and you learned how to program in SQL, but we're increasingly seeing more and more candidates

who have worked with these,

object relational mappers, these ORMs inside frameworks like Django and Rails, and they don't really have the the SQL coding experience. So those are really critical for you for attracting the right people. Yeah. There's a great talk by Jez Humble talking about how to

grow to your DevOps engineers

everybody for these people who are already at the top of the game

at the top of their game when you have a huge amount of potential within your existing employees and just giving them the opportunity and the training to get up to speed on these technologies and give them the space to

learn and gain the experience necessary to help your organization be successful.

Exactly. Yeah. That that's what we run into.

But, yeah, it it seems like I guess there's just there's there's an expectation that while this this this AI, this machine learning stuff is so hard, you know, hard people could never do it here. But if you do, if you if you look around the organization,

you'll find people that are

probably already self motivated. You know, the the person who's been programming in in Python for 5 years, who on their own is taking online courses and learning how data science works, learning how models work. I mean, these are the kind of people you really wanna give an opportunity to. And particularly,

as you mentioned before, the majority of the actual work that's necessary

for getting a machine learning project out the door isn't somebody who's doing

groundbreaking research into new types of models or new neural network

but somebody who's able to take TensorFlow or PyTorch or your machine learning framework of choice and put it to its best use of running it through model training and then ensuring that the model

matches the expectations

and then being able to deploy and monitor it in production.

Exactly. That's well said, Tobias. Yeah. I it, most companies are not in a position where they're gonna be pushing the state of the art. They need to just they need to leverage their data and their customer and their relationships

and take what exists off the shelf and put it to work. And so with these different responsibilities

in a data team where you have the data scientist and the data engineer, and in some cases, the machine learning engineer that sits somewhere in between those 2. What are some of the potential challenges or attentions that arise in those team structures and some of the strategies that you found to help ensure that the team reaches cohesion and is able to be successful in shipping the final product? Yeah. This is the number 1 challenge that we we see companies running into. And I you know, and I and I could spend a lot of time talking about this guy. I spent a lot of time thinking about it and talking to clients about it. And the root of the problem is that data scientists

and data engineers

are work differently, and they fundamentally think about the world differently.

And what I mean by that is

data scientists are in discovery mode. So even if they download an off the shelf algorithm,

they're trying to use data to train that algorithm to do a specific task. And so it's very iterative. It's testing. It's exploring. You know, it's this it's it's it is really Jupyter Notebooks are called notebooks for a reason. It's like a scientist's notebook. And so

90% of what they do is getting thrown out and maybe 95%. It's just a small portion of that that ends up getting deployed. So their their entire way they work is designed

for,

for optimizing discovery process. And data engineers are doing operations. They're trying to ensure that things

are reliable,

do what they say, stable, testable,

and can be iteratively improved. And these are completely different ways of viewing the world. And we typically run into a very common situation as we're going to a client, and the the machine learning team, the data scientists have some brilliant people that have been working on a model for 3 or 4 months.

And, and, you know, they've got their notebooks built, and they've got all their data in there, and they've got all this code running, and then they they you know, and it becomes time to deploy the thing. And management hasn't thought about this in advance, so they they sit down with the program the developers and say, hey. Here's our notebooks. You know, can you guys deploy this?

And the programmers freak out. Right? Because they they look at this code and think, oh my god. You know, I don't understand what this stuff is. They're not maintaining state between sales. What, you know, what is this notebook? And that that was my reaction the first time that I saw this. So I you know, as far as what can you do about it, I think, first of all, just recognizing that people are coming at this from 2 different perspectives.

They're not easily going to work together. And and the the check the process of taking

notebooks and putting them into the into production is a complicated 1. So as far as,

I can

probably get into, you know, the actual development process a little bit later because there's more I can talk about that. But as far as improving communication between the 2,

the more you can get your data scientists

becoming better programmers, the more they can abstract things in their notebooks into classes and methods, the more they can organize their notebooks into a serial workflow that that an engineer could easily translate into a production

application into, you know, to a data pipeline, the better that communication's

gonna be. And by the same token, the more your data engineers know about data scientists, if they've taken a basic data science

course, if they know what Pandas is, NumPy, the basic data science stack, they're gonna be able to understand what the data scientist is doing a lot better and be able to put things into production.

Yeah. There's a great,

interview that I did earlier as well with, Matthew Seal from Netflix about their experience

of trying to build a unifying

environment for

using notebooks as the common method of communication between the different roles of the team as well. So I'll put a link to that in the show notes. And on the point of trying to build empathy between the data scientists and the data engineers

and trying to bridge the sort of inherent conflict between their roles. There's another interview I did about that as well, and I did a presentation

at DevOps Days Boston on the concept of DevOps for data engineers, which I'll add the link to the show notes as well for anybody who wants to dig a bit deeper into some of those various topics. Yeah. Your interview with Matthew Seale, was excellent from Netflix. Netflix. The way he talked about how they're doing it,

really covered a lot of the major challenges we see with companies that wanna put notebooks into production.

And,

I listened to it at least twice and was taking notes and sort of very carefully tried to parse his words, but I would encourage caution to any companies out there that wanna start by putting notebooks into production without really thinking about the infrastructure implications

of what they're doing. For most of our clients, unless you're doing something that's, you know, relatively

straightforward,

notebooks into production is probably not where you wanna start because you don't have Netflix's existing infrastructure that's able to to handle the this type of unstructured

environment and code where you can't maintain a state. Just out of curiosity, have you seen any of Joel Gruss's work? He has a he has a great presentation where that this call would, and then we can put this in the show notes as well. But I don't like notebooks, and he gave this presentation

at the Jupyter conference. That's all of it. Basically, batching notebooks for 45 minutes. Yeah. That was, I I saw the slide deck. I didn't see the presentation yet, but it's something on the order of a 130 slides that he zips through to cover all of the different shortfalls with notebooks as a development environment and some of the ways

to work around it and improve the overall workflow. So I'll definitely add a link to that. And he also

wrote a good book about data science from scratch for anybody who wants to get a better understanding of some of the fundamental principles and algorithms behind what powers a lot of the current set of data science tools. Yeah. He's got some great resources. And, yeah, I the first time I saw his his notebook, I yeah. It was crazy. It was, like, a 150 slides, and then, like, there's animation in there. So I don't know if you ever heard him speak, but he's, he's also a comedian, so he's he's fairly entertaining. So I think a lot of his advice was designed to be entertaining more than educational. And I I I have to say, I do like notebooks. I think they have a great, I think they have a great use. We use them all the time, you know, but I think, he does bring up a couple of of good points as a challenge to maintaining state and structure and and why it's difficult for some engineers to be able to use them. And so moving into the actual project phase, can you give a a brief discussion

of

the steps involved in developing a first machine learning model and putting it into production, including some breakdown of where a lot of the time was spent at the various stages? Yeah. I can actually walk through a, a a really specific project we did last year because we we went to some ends to

try to document the time we spent in different phases, and it was the first 1 for a company, so it's probably a good use case. So in this instance, it was a large

Fortune 500 company that had done very little machine learning. They had a new data science team, and they had spent some time doing a lot of data gathering and data cleanup, but they really weren't sure what models they wanna put into production. They had a bunch of ideas, and so we started working with them and helped them think about a couple of different initial models they could put into production. The 1 that they decided on was to predict

how long a particular internal business process would take. Usually, we find that, you know, when you ask management, you know, what do you want us to work on first?

They they come up with the, oh, we'd like you to build an AI system that can, you know,

automate what these 5, 000 people are doing. They aim for the stars, not realizing that they're they're asking for something that's gonna take 10 years. So part of our work is being able to, you know, educate them on making the choice of the right problem to solve. So 1 of the most

frequent ones we suggest is actually predicting how long the business process will take. And the reason it's a good first choice is because, technically, it tends to be fairly straightforward. You tend to have plenty of training data. It's easily interpretable. You know, you don't have to get into a debate about how long something is. You know, time is time, and it's usually in your database when it's a number of days. And usually, you can rely on off the shelf models from scikit learn and 1 of these stacks to be able to do the predictions. The benefit of doing something like this is a a project like this of is that you can almost immediately apply it into your operations because

if you can predict how long the business process is gonna take at the outset, then your management or operations can intervene and try to approve things, and you get an opportunity to begin building up all of the infrastructure

and process and team development you need to get your first models up and running. So our our number 1 piece of advice when companies work on their first model is to get it up and running as fast as they can. And that's because

often the biggest risks on these kind of projects are that nobody's gonna use it or it doesn't really provide the business value you need. So in this particular instance, we got the first release up in 65 days, and this is a company that, you know, they just had some data. They had very little infrastructure. And then after the first release was done, we worked through

2 week release cycles over 5 releases

that resulted in a 47%

improvement in how the model works. And so I I like to use this in examples when we're doing planning other projects because

it's it's an it's an example of how a big company can get things up and running without spending

months months months training models or wrangling data or all of these things that you don't often need to do when you're trying to get off the ground for the first time. The biggest value in the entire process was actually putting a first thing into into production, and that allowed the team to begin doing everything you need to do to

machine learning, help people understand what a probabilistic model is, help them build their 1st data pipelines, and help them build their 1st set of metrics,

and help them realize the need for data engineering

and the the shortcomings in their in the the team that they had, and then the tools for putting models into production. So that was 1 example. And once the model is in production,

I know that there are various metrics that are useful to track to ensure that you don't have issues with model drift and ensure that you have the standard set of operational metrics of system performance

and memory and CPU usage and particularly in the case of machine learning or deep learning networks, any GPU usage. So what have you found to be the most useful metrics to track and the most useful ways to track and analyze those metrics to make sure that the model is running effectively

and signal when you need to potentially retrain the model to redeploy it. Yeah. We usually find that most of our clients are pretty good at tracking the, like, infrastructure level,

metrics, you know. So, like, you know, knowing when they need to go up to a bigger GPU if they're doing deep learning, or they're running out of memory. Like, that stuff is they usually have a pretty good handle on that. What we find is that people don't or not, at least initially, not really ready to think about what probabilistic

models do and what they need them to do. And so

we start off by kind of walking through the, you know, the example of, okay, you know, what is your model gonna do? And they pull in data and they make probabilistic predictions.

And the the challenge with this type of infrastructure is that it's not always obvious if there is a problem. So if your business is running a recommendation engine and it's making lousy recommendations,

customers don't call you up. You know, no one's ever called up Amazon and said, hey. You didn't recommend great products to me. Right? They just don't buy. So you have to be able to,

you know, have some infrastructure in place that tries to alert you to your biggest problems. And usually those are, is the data there, or, you know, are you getting all the data you need, or is it just suddenly just all is it, you know, pile of nodes in the database? Is it in the right format? Is it in the acceptable bounds that the models expect? Usually, we find situations where clients are subscribing to some third party API. Well, you know, that company can release or change its API anytime. And if your models rely on that, you need some sort of validation check that ensures that it's within the bounds that the the that the model expects. And, usually, you can get away with really simple stuff, so just run a, you know, standard deviation or variance

and and a mean on, you know, a certain set of the input data and have some alert that goes off if it gets outside of those. That'll take care of 99% of your problems. And then on on the other side of the model, on the output, making sure that your model's within some acceptable bounds.

Are you getting you know, if you expect 5 recommendations, are you getting 5 recommendations?

And finally, the 1 that is, like is the 1 that I see miss the most time is actually ask if anybody's using it. Like, are people using your model? Are customers, you know, in a of course, you know, if you have a recommendation engines, are you improving recommendation? But if you have a model that's producing something for an analyst or a sales team, go ask them, are they using it? How do you know, if so, you know, how?

You know, I can't tell you the number of times I've I've seen, you know, companies put together a a lead scoring or some other tool for their sales team, and, you know, you you sit down with the sales team and say, hey. How's your lead scoring work? And, and they say, well, we don't use that. We still just call everybody. Right? Well, you're it doesn't matter how good your model is if nobody's using it. But then beyond that, the some of the metrics you measure, you know, infrastructure memory usage, CPU, GPU,

We usually find that companies have pretty good handle on those things. And as far as the actual deployable artifact when you are releasing to production,

what are some of the common ways that that is built and managed

and the format that it actually takes for anybody who hasn't encountered that before. Yeah. I think it's, I think people are surprised the first time they see an actual model because there's all this mystery around it. But usually, the model itself is just a file which gets loaded into an algorithm. You know, it's what, you know, the algorithm is typically a a Python library or some sort of module that you build yourself that maybe wraps around it or if you have your own custom code. So the model itself is not is is fairly straightforward to,

to deploy. And sometimes it's just running on some API that is set up to receive a JSON call. The complexity is everything around it. And so it's the process of, you know, ingesting data, of validating, of transforming, of training the model, and then deploying itself.

And that's really where the challenge comes in and building that that pipeline and that infrastructure.

It's it's just a set of connected

Python codes that run-in some in a data pipeline, but that ends up being the harder part of deploying.

We see

there's basically 3 ways you can do this, and

and 2 of them I don't usually recommend.

So, you know, if you think about what are you what you're doing in machine learning, you have a data scientist that's working in a notebook and they create a model, and then they they store the model as a file somewhere, and then at some point, you gotta deploy that.

Well, the 3 approaches to doing it that I've seen are, yeah, they deploy notebooks. And your your your interview with Matthew Seale really covers an excellent infrastructure for doing this and and when you want to to apply that kind of, you know, to be able to deploy notebooks. But for

I think that is a there's a lot of, deploying notebooks can work well if you're doing

some analytics,

if you're maybe producing a CSV file that your marketing team is gonna run and use it for a campaign. But if you're actually talking about machine learning models,

there's just a lot of risk into doing that, especially because it's very, very easy for a new team to get sloppy in how it it deploys notebooks. If you wanna try to take your entire ingest transformation,

you know, validation prediction steps and lay that up in a series of of Jupyter code. It's very difficult to do unit testing, and it's very difficult to to to deploy, debug, and approve. So what we recommend for people is to use the notebook

as the handoff between data science and data engineering. So get your you work on 2 week release cycles, have your data engineers that are staggered 2 weeks after data scientists,

and use the notebooks as the interface between the 2. And if you do that, the data scientists and data engineers will naturally

get to understand what each other's,

goals are, and they'll get good at doing a deployment process. Yeah. Admittedly, it's not ideal because you do have some duplicated work. You have code that's done in a notebook that has to be rewritten into a Python application, but it's the most reliable we way we've seen for a team to get started with their initial, deployment of machine learning models. And then as far as the basic technology stack that's necessary

for building out this pipeline and putting those first machine learning models into production.

I'm curious

how the build versus buy debate breaks down,

particularly

given the amount of open source that's available, but also with your particular client set being larger organizations,

just what that discussion looks like for you, and then the types of products that you typically recommend to your clients. So the, as far as build versus buy, I approach this like I do with any you know, at at the first pass, I approach it like any do with any any software buy versus build decision. And then it's generally speaking, you wanna buy things that are noncompetitive,

and you wanna build the things that are strategic and you get the leg up. So for example, you wouldn't go out and build, you know, a customer relationships management platform. Right? Because there's a bazillion of them, and you can just go sign up for Salesforce.

You're gonna be competitive based on how good your sales team is, not how good the tool is. So as an example of, you know, you would never wanna build something like that. And I as AI starts to evolve, there are going to be some fundamental building blocks that you can just buy from a third party. You know, at some point, there's gonna be a good level 0,

NLP help desk solution that you're not gonna wanna build yourself because you're gonna be able to buy 1 off the shelf from Amazon or some other vendor. But if you have something which is competitive, you know, if your recommendation engine is very specific to your products and it's it's something that's gonna drive revenue and it's gonna drive your sales team, then you probably wanna build that in house and you wanna own it. Because Amazon's off the shelf recommendation engine probably isn't going to do what you need it to do because your business is different. So that that's the first consideration.

The other consideration is data. And, you know, so when I when I asked when companies talk to me about, you know, hiring a a a vendor,

turning their data over to a vendor or a startup to do something for them, I talk them through the issue of what happens,

you know,

the the loss of control over their data. And it's despite whatever contractual

structure and constraints you put into place,

You can tell a company, hey. You're not allowed to train

competitors' models on our data, and our data has to be isolated in order to make predictions on on on our tool. But the reality is the way engineers, the way data scientists work is extremely hard for that to happen in practice because

just the fact that just by the nature of their work, these engineers are trying to make their models better and better and better, and they're using client data to do that. So 1 way or another, you have to be extremely careful about your data is not being abstracted into the product in such a way that's gonna benefit your competitors because that is your most valuable asset. So those are usually the 2 questions I ask people. Now there are some times

when you could care less about the data. And my favorite example of this is Expensify. So I I don't know if you use Expensify, but it's a start venture backed startup that manages corporate expenses, and they use some NLP and some computer vision

in there to make it a lot easier. It's a it's a fantastic product. We use it. But there's not too many companies that are concerned about, you know, maintaining control of their employees' restaurant receipts. Right? There's no there's if expensive, I can use all that data and make a better product. You know, hallelujah. You don't care if your competitors know where you ate last night. So it really depends on those 2 questions. How competitive is it, and how important is proprietary ownership of your data. As far as products that we recommend, I think that the kind of stuff we recommend is really the this the stack that most of your audience is familiar with and what I hear you discussed on on most of your other, podcasts.

So, obviously, if you're doing anything into any kind of deep learning, Nvidia is the only game in town, so it's Nvidia GPUs.

I don't think I've seen an instance where anybody is using anything else. I know this Intel and Google are coming out with competitors, but for the most part, you know, just the just the tools that NVIDIA has are better. We recommend the the the Python stack that pretty much everybody else works with, Jupyter Notebooks, NumPy, Pen, the scikit learn, PyTorch. If you're gonna be building in house, use all of those open source tools. As far as model deployment goes,

you know, the

you know, increasingly,

we see Airflow becoming the the tool of choice. And if anybody wants to learn more about Airflow, I would highly suggest that listen to episode 43 of your podcast. Your interview with, James Meekle on a new airflow installation was excellent and, really covers a lot of the the the key issues that we run into in using Airflow. So so those are the main tools. And after that, you know, it's the typical DevOps stack, get Jenkins. And then when it comes down to monitoring,

it's either the tools on Amazon or most companies have their existing production operations team that can monitor models for them. And in terms of deploying the models

and managing them, what are some of the major risks that you would call out for anybody who is first exploring this space and wants to have a successful launch of their first product? So I think that the number 1 so our number 1 recommendation for avoiding product failure is get your model in production as fast as you can. So the the the biggest risks we see are people waiting too long. And if your team is still gathering data and training the model, you know, months after starting, you know, it's really worthwhile

stopping everybody and and trying to make sure that you're on the same you're on the same page and you're aligned with business goals.

Most of the problems can be dealt. You're gonna you're only gonna know when you deploy a model. So it's much better to deploy something today even if you have it in a nonproduction

format. That's gets you 90% accuracy rather than trying to spend 9 months getting 97%

accuracy because you're not really gonna encounter the challenges

until you go live.

And I think once you go live, the the first question I would ask is is just kind of taking a step back and looking at your operations and and what are you trying to accomplish and where's the biggest risk. If you're deploying a recommendation engine, then your biggest risk is that you make a wrong recommendation.

That's, like, catastrophic. Right? I mean, it's not like, you know, nobody's gonna call and scream and yell because they got the wrong recommendation. They're just not gonna they're not they're not gonna buy. So let's say, you can it's a type of model that you can put into production quickly without a huge amount of production risk.

If you're gonna be launching rockets in the space or you have a, you know, a a battlefield

technology that's got intercept missiles,

obviously, you're dealing with an entire different level of testing and validation has to be present at that point. But we find, for the most part, in terms of, like, testing models and good validation sets and making sure there's no bias, Most of the data scientists are really good. If you follow the basic advice in terms of, you know, setting up a crest correct test set, setting up a correct validation set, the testing for overfitting. If you do the the basics, you can hit 90% of the problems.

Where we seek you know, where we run into challenges,

is when, you know, teams are just you know, it's their first time, and they're not really focused on, putting together a good test set, and they introduce a model into production that's overfed. It just doesn't generalize well when they go into production. But we don't really run into that that often. And for any software engineers

or team leaders who want to break into machine learning as a career path, what are some of the data engineering and statistical skills that you would suggest they learn and any advice that you have in terms of positioning themselves for the right opportunity and maybe any useful references that you might suggest? It's a great question. In fact, I think if any software engineer, any server side software engineering, you know, is is feels like they're in a, you know, kind of a career rut, I would really highly encourage them to take a look at at machine learning. And so and and by that, I mean, go online and take a couple basic data science courses. You don't need to become an expert

in being a data scientist,

and you you you have to know how to use notebooks, how to use NumPy, how to use Pandas, you know, how to build a model. Once you understand what overfitting is, you know, what training data is, what FX are, just understanding the basics is gonna really help your career. If you wanna do more, I would highly recommend, Jeremy Howard's fast AI course.

It's the best out there and, you know, his his explanations. And if you can get through his course, it really will demonstrate your your understanding of the fundamental building blocks of this technology. You know, beyond that, get good at your basic Python application development and DevOps stack. So the the standard type of tools that you would work with in any large

data infrastructure environment, AWS, Git, Jenkins,

you know, building good Python code, learning how to build good test code, unit tests, and integration tests. All those basic software skills are good. And then finally,

if you can start playing around with Airflow,

you know, maybe take a find a notebook or a basic model you build in the data science course and turn into an airflow pipeline, that's something that, not a lot of of people are doing. But I guess more than anything, I would just go out and look for these opportunities. I mean, if you have you know, you know, go to your manager and and, you know, repeat this 1 very simple sentence.

I want to help data scientists put models into production

because not many people are explicitly

saying that they wanna do that. Everybody wants to become a data scientist. Nobody that's in machine learning wants to do all the plumbing. And if you actually can position yourself as somebody who's excited about that and has taken the opportunity to to learn this this valuable and hard skill, it's really gonna position you for some great opportunities.

And are there any other aspects

of bringing machine learning into production

or the data engineer and data science career track or anything

along the lines of what you're doing at Prolego that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I would guess maybe the only other advice I would get is if you've spent a lot of your career, you know, in the past 10 years

working with Django or we also have 1 of these frameworks, take the time to to bone up on your SQL skills. There's just so many instances where good SQL skills could come into play when you're trying to pull data in, when you're you're building models,

or when you're actually trying to put something into production. I'm surprised at how many junior to mid level developers just have not had the opportunity to spend a lot of time doing SQL.

So it it's, not everything is gonna be an active record in Rails, and if you can get some good SQL skills, that'll really benefit you. Alright. Well, thank you for all that information. And for anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think it's the 1 we discussed previously. It's getting Jupyter Notebooks into production. And so it's taking it it's basically taking a notebook, something that a data scientist creates,

and deploying it.

There's there's really only 3 ways you can do it. 1 is that you don't, use notebooks at all. You take the Joe Gress approach.

The second is you try to put notebooks into production, and then you're gonna be dealing with putting together some more complicated infrastructure to manage that. And the third 1 is where you have some sort of handoff between the the machine learning engineers and I'm sorry, the the the data scientists and the data engineers. So I think that's the that's the biggest challenge we see, and I think if anybody's looking for a good start up idea, that's 1 of the problems that I more interesting problems I think our industry needs. Well, thank you very much for all of your insight into the current state of the machine learning industry and the advice that you're providing. I am sure that a number of listeners will find great value in that and possibly encourage a few people to change their career track. So thank you for that, and I hope you enjoy the rest of your day. Thanks a lot, Tobias. Appreciate

it.

Data Engineering Podcast

Summary

Introduction

Interview

Contact Info

Parting Question

Links