Artificial Intelligence

Data Labeling That You Can Feel Good About - Episode 89

Summary

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what CloudFactory is and the story behind it?
  • What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
  • What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
  • Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
    • What protocols do you have in place to ensure data quality and identify potential sources of bias?
  • What role do humans play in the lifecycle for AI and ML projects?
  • I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
    • How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
  • Can you share some stories of cloud workers who have benefited from their experience working with your company?
  • What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
  • What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
  • What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
  • What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
    • How does that tie into your plans for CloudFactory in the medium to long term?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Deep Learning For Data Engineers - Episode 71

Summary

Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off
  • Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it?
  • What has been your personal experience with deep learning and what set you down that path?
  • What is involved in building a data pipeline and production infrastructure for a deep learning product?
    • How does that differ from other types of analytics projects such as data warehousing or traditional ML?
  • For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of?
  • What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate?
  • What are some ways that we can use deep learning as part of the data management process?
    • How does that shift the infrastructure requirements for our platforms?
  • Cloud providers have been releasing numerous products to provide deep learning and/or GPUs as a managed platform. What are your thoughts on that layer of the build vs buy decision?
  • What is your litmus test for whether to use deep learning vs explicit ML algorithms or a basic decision tree?
    • Deep learning algorithms are often a black box in terms of how decisions are made, however regulations such as GDPR are introducing requirements to explain how a given decision gets made. How does that factor into determining what approach to take for a given project?
  • For anyone who wants to learn more about deep learning, what are some resources that you recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So you should check out our friends at Lynn node with 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform if you need global distribution they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai go to data engineering podcast comm slash lindo, that's Li n. o. d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of the show. Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you're tired of wasting your time, cobbling together scripts, and workarounds to give your developers data scientists and managers the permissions that they need, then it's time to talk to our friends. At strong dm. They've built an easy to use platform that lets you leverage your company's single sign on for your data platform. Go to data engineering podcast.com slash strong dm today to find out how you can simplify your systems and go to data engineering podcast. com to subscribe to the show. Sign up for the mailing list. Read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data platforms for even more opportunities to meet. listen and learn from your peers. You don't want to miss the strata conference in San Francisco on March 25, and the artificial intelligence conference in New York City on April 15, both run by our friends at O'Reilly Media good at engineering podcast.com slash strata con and data engineering podcast.com slash API. com to register today and get 20% off your hostess, Tobias Macy. And today I'm interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects. So Thomas, could you start by introducing yourself?
Thomas Henson
0:02:20
Hi. So I'm Thomas Henson, a Pluralsight author and involved in the data engineering community and also work for Dell EMC in our unstructured data team. So I've been around data engineering and really around the Hadoop ecosystem probably for the last six years since since before Hadoop to Dotto and I've just been a part of that community and love it
Tobias Macey
0:02:41
And do you remember how you first got involved in the area of data management?
Thomas Henson
0:02:44
Oh, yeah, hundred percent. So, you know, going through college thought for a long time that I was going to be a DPA so really, that's kind of what I was targeting. And so when I graduated, you know, the job market being what it is and it doesn't matter You know what, what error right you're in, you know, especially when you're getting out of college, you're you're having to apply for a lot of different positions. And I actually got my first role as a web developer so totally different right and then being a DPA so but always kind of had that passion. And I guess it would be I would be considered a full stack developer. So I did do some database management to some extent for applications but nothing nothing too too ingrained like a traditional BBA. And then you know, lo and behold, a few years later, there was a research project that came up and it was going, I didn't know at the time that it was going to be a big data project, but I knew was gonna I was going to require a lot of lot of information and just really take me outside of my comfort zone. So a volunteer to get on that project. And it turned out we were using Elastic Search at the time. And then we we rotate it into using Hadoop so you know, download the hard work sandbox. And then Cloudera had one at the time, too, and that that was kind of that was kind of my path. And, you know, when I went to my first
0:03:53
I think Hadoop summit, and, you know, from there, I just started looking at, and just really saw this community and really saw my opportunity to get into data from, you know, what I looked at, you know, in my college days, so, I haven't looked back since, and recently, you've started getting into the area of deep learning and experimenting with that.
Tobias Macey
0:04:12
So can you start by giving a bit of an overview of what deep learning is, for anybody who isn't familiar with that terminology?
Thomas Henson
0:04:18
Yeah, so from a data engineers approach, I haven't really, you know, I didn't really spend much time on the algorithms and just kind of focusing on some of the machine learning pieces. And, you know, that portion, and I'm not saying it was like, kind of a dark box for data engineers, but it's not something that really, you know, spent a lot of time like, I was worried about being able to stand up our Hadoop cluster, or stand up our environment or, you know, writing, you know, at the time MapReduce jobs, or Spark jobs, and, you know, those kinds of pieces, and kind of left the data engineering to the data engineers, but, you know, slowly started looking into, you know, okay, well, you know, I know, which algorithms were using, let me let me find a little bit more, you know, kind of underneath the covers about, you know, what those are, and so started kind of having that approach to looking at it, but specifically, you know, if you're looking, you know, as it from a data engineers perspective, or even a data science perspective, you know, the real difference in key between deep learning and machine learning is going to be the use of neural network. So you're using neural networks to be able to, you know, go through and analyze your data versus, you know, with machine learning, it's a more of a approach where, hey, you know, we're taking all these few different feature sets, like, one of the famous examples is, if to identify cats on the internet. And I don't know why that you want to be able to identify cats from YouTube videos, maybe just makes for amazing YouTube videos, I don't know. But that's, that seems to be like the first use case. And so if you think about, you know, the machine learning approach to how you're going to, you know, identify a cat from a video is, you know, you, you're going to program in the different features. So, like, you know, features in a, you know, how, what are the ears of, like, you know, doesn't have hair, even though there is hairless cats, but, you know, just, you're going to assign those right, the whisker link and some some of those other pieces. And you're going to run those through your different algorithms. So if you're using SPD, or if you're using, you know, some kind of decision tree, you're going to pick out the algorithm, and you're going to test and run that through. And that's the machine learning approach. But, you know, from a deep learning approach, what you're going to do is, you're going to have this labeled data set, and you can have unlabeled data set, let's just keep it with it, labeled data sets here. And you're going to feed those images of those cats through, and you're going to be able to identify and let the neural network decide, okay, this features or hair or the whiskers, you know, what, what makes the biggest difference there, and you can kind of evaluate how that looks. So, it's a, it's a different approach from what we've done for machine learning. But just as a data engineer, it was just kind of fascinating to me to kind of, you know, wanted to step back and take some time to really learn kind of what our data scientist and, you know, our team kind of go through to, hopefully, you know, make me a better data engineer, so I can understand algorithms and kind of go through that process.
0:07:02
So at the end of 2017, a couple of a couple of groups. So I do a pop podcast with some other people in the data engineering and, you know, the data analytics world. And so there were a couple of us Aaron banks, and Brett Roberts, we, we were looking at doing a Coursera course, and just being able to kind of go through it. And we wanted to do the most famous one. So, like, you know, the most famous machine learning course, with, you know, engineering, taking everybody through, I think he's taught more people on the planet about that, than probably anybody else. And so we were like, Okay, this is the most popular course on the planet, let's kind of go through it. And it was very hard. So, you know, we kind of looked at it as, like, Oh, this, you know, an online course, is something we can do together, we record videos after going through it. But it really took me down more of a math path. And I mean, I guess I should have known that. But so, you know, after kind of going through that, and really understanding more about machine learning, just doing some work in you know, in my job at Delhi MC, I'm part of a group called the unstructured data solutions team. And so, you know, being a part of that group, there's a lot of things going on in the deep learning world, that was kind of challenged by some of my co workers and the other business units to understand more about that. And so, you know, I took what I learned in the machine learning area, I kind of really apply that to what's going on from a deep learning. And so I started learning, you know more about TensorFlow and pytorch, and what's going on from a GPU specific basis, and just kind of going down that path. So, you know, it wasn't that I was targeting At first, the deep learning and just kind of thought would be good for me to understand, because I got, you know, we continually get questions is, you know, somebody who's advocating out in the data engineering community questions around data science. And so I just thought for me to be more well rounded, that it would be good for me to be able to answer some of those questions around, have a better understanding for it, and just kind of evolved into, hey, we need to check out what's going on from a TensorFlow perspective, and just kind of hadn't looked back for the last year. So
Tobias Macey
0:08:43
and particularly from the perspective of a data engineer who's working on building out the infrastructure and the data pipelines that are necessary for feeding into these different machine learning algorithms or deep learning projects, what is involved in building out that set of infrastructure and requirements to support a project that is going to be using deep learning, particularly as it compares to something that would be using a more traditional machine learning approach that requires more of the feature engineering up front, as opposed to just being in the label the data sets for those deep learning algorithms?
Thomas Henson
0:09:19
So that's a good question. I as you start looking at it, and, you know, kind of the way that I approach it one with my learning, and just kind of the way that I like to describe it is just you think about, you know, my experience from the Hadoop ecosystem, right? And, like, how, how does it differ, you know, from what we're doing deep learning to what's going on from the Hadoop ecosystem perspective, and you think about it, you know, in the, you know, in the new world, you know, your data in HDFS, or what we're trying to analyze it still, you know, it's somewhat structured, you know, or semi structured, we call it or, you know, we call it unstructured data. But it was really, it was really like, still a lot of text data in other portions like that, versus what we're doing from a deep learning approaches, you know, we're talking about, you know, mostly, you know, image data, or voice recognition, just rich media, right, like, even video data. And that's really kind of one of one of the key portion. So with that, you know, when we talk about big data in Hadoop, we were talking about large data sets. But now, you know, in a deep learning side, we're talking about massive data sets, right? Because, I mean, how much how much video data does it take to, you know, create the next driverless car, right, we're still we're still going through that, and figuring that out. But I mean, you can just imagine, you know, if you're doing any kind of simulations, or anything like that, I mean, we're talking about lots of lots of sensors, and lots of lots of data points. And so there's some challenges there. And, and one of the big keys to that's really kind of push forward deep learning and why you're seeing other projects from the traditional ecosystem, like, so there's projects like project hydrogen, submarine, and even what Nvidia is doing with rapids, they're trying to get more into the GPU. And so the GPU is giving you the ability to analyze data faster, even do ATL faster. And, you know, that's really kind of accelerating it. So it does bring up challenges whenever we're talking about building out that data pipeline, and how you want to, you want to kind of progress to it. And, you know, there's, there's not really any answers just yet to how it's all going to kind of go, because it's still somewhat fluid, right? Because, like, we know, you know, if we look at what we're doing, you know, let's just take TensorFlow, for example, right. So like, what you're doing, when you're setting up a TensorFlow environment, you know, it might be something as simple as you're just setting up, you know, different shares, you know, so you have some, you have some NFL mouth, right, where you can just analyze, you know, all this data, and, you know, you're still orchestrating it, and you're still going through that portion, but to, to build up those data pipelines, you know, you might just have one, one data set, right, or, you know, one large set of that data. And so I think, really what the key and what we're in, maybe in 2019 and beyond is we're starting to look and say, hey, how can we bridge that data with what we will, you know, what we have in our Hadoop ecosystem, right, or what we have in other data sets. And not that I'm saying Hadoop is going to be the key to that, or, you know, even, you know, what we call in the Hadoop ecosystem, but it's, it's still trying to kind of interesting to see how that plays out, right, like, you know, we're at this point, now, we're taking advantage of what's going on from a GPU perspective. And now we want to now we want to do like we do, you know, with other projects throughout the years, right, you know, that we've seen in the past is, can we marry this with other data that we have, or other decisions that we've already made. So it's, it's real interesting. And, you know, there's, like I said, a lot of different lot of different approaches to it. And we're still kind of going down that path.
Tobias Macey
0:12:16
And I think your point to about the fact that deep learning is particularly applicable to these projects that are focused on rich media, as you put a video or images or audio, it starts to look more like a content delivery pipeline than necessarily the traditional data pipeline that we're used to where we might be working more with discreet records, or, you know, flat files on disk, or things like that, that have a lot of structured aspects to it, where there might be similarities between records that are conducive to different levels of compression, or aggregation. Whereas with video, in particular, and even audio, there is a lot less of that similarity from second to second within the content, but also between files, because there are so many different orientations that are possible for an image frame or anything like that. So just conceptually, it requires a much different tack as to how you're managing the information, and how you're providing it to the algorithms that are actually processing it,
Thomas Henson
0:13:16
oh, hundred percent, I mean, we're talking massive amounts of storage, right, to be able to, you know, like said, thinking about video data coming in, and most of you know, most of that, and it's format, it might be compressed to some extent, but it's, there's not gonna be any Dee duper, or some kind of compression that we can do, you know, for the most part, right, like, you know, one video file of a car driving down the road versus, you know, a different view of that same one, it's, it's, you know, it's not going to do well, it's not, it's not going to have that. So there are some challenges there. But one of the things is, you know, I did say, Hey, you know, when we look at it from a rich media type, you know, traditionally what we did, we're talking about Spark and Hadoop, and, you know, anything in that Hadoop ecosystem is, you know, still kind of text based data, what's still the same thing here. So, I just want people in the audience to understand where still breaking the data down, we're just breaking it down into, you know, let's just say that we're doing gray scale, right, we're still we're still breaking down into matrices of zeros and ones, but it's a lot of zeros and ones, right, for, for one video, or an image or, you know, anything from an audio perspective.
Tobias Macey
0:14:16
And particularly for formats like video or audio, where the information in relation to the other attributes of you know, the frame to frame is important and contextual, it makes it much more difficult to identify what are the logical points where we can split it, versus something like a park a file, where if an individual file starts to grow beyond a certain set of boundaries, you can just say, Okay, I'm just going to split it at this record boundary, it's not necessarily possible to do that with video or audio without compromising the value that you're getting out of it.
Thomas Henson
0:14:47
Yeah, I mean, so that's for sure. Like, you're looking at it from that perspective of, you know, being able to how you compress it, or, you know, break it in even know, we're talking about massive amounts of data, large, you know, data sets, being able to break those into chunks. But I mean, even thinking about it from a compute perspective, when we're just talking about RAM, right. Like, a lot of times, whenever we're talking about being able to run a job, or run some, maybe if it's a spark image, or a spark job, or some map, you know, traditional MapReduce job in your cluster, you might, you might have a ton of RAM, right. But, you know, think about, you know, I mean, we're talking at this scale, we're talking terabytes or petabytes, to, you know, I was just reading an article where they were talking about the predictions that we're at 33 zettabytes of data worldwide today, and by 2025, so less than six years away, we're going to be at 170, 175
0:15:34
zettabytes. And so like I mean, it's just, it's just massive, you know, it's just crazy, just to think about how big the data is, and how much data that we're creating from this. And I mean, it's also fun, because we're changing the way that we interact with society out there. And we can get into, you know, where we think AI is, and where we think the boundaries are, and how much how much of it is maybe hype or not. But I'll say that my favorite thing to kind of talk about whenever we're talking about just a as a concept is really, it's just an extension of automation at this point. But it's just automation that we could do, right.
Tobias Macey
0:16:05
And in terms of the actual responsibilities of the data engineer for the data as it's being delivered to these algorithms, particularly as it compares to machine learning, where you might need to do up front feature extraction and feature identification, to be able to get the most value out of the algorithm. My understanding is that with deep learning, you're more likely to just provide coarse grained labeling of the information, and then rely on the deep learning neural networks to extract the useful features. So I'm wondering if you can talk a bit about how the responsibilities of the data engineer shift was, you're going from machine learning into deep learning, particularly from the standpoint of feature extraction and labeling.
Thomas Henson
0:16:49
Yeah, so ETL is not going away, you know, there's still going to be ATL involved, and they're still going to be, you know, whether we call it data wrangling or data, data mining, right, we're still a lot of what I'm seeing a lot of what we're talking about, and, you know, I've talked to the Chief Data Science Officer, you know, at SAS, and, you know, one of the things that he was saying is, you know, we're still mostly doing supervised learning. So we're on the path of supervised learning, where we have to have these train labeled data sets, right. And so, you know, data is still King and labeled data is still King as well, just because of that fact, you know, you know, we do think, you know, in the next in the next five years or so, we might start seeing more advances from an unsupervised learning perspective, and just really seeing that, but there's still a lot of time. And I think, I think there was a stat that that was out there, and I wish I could credit who it was from, but I think 79% of a data scientist or data engineers job is still things outside of data engineering, and data science. And part of that goes back to, you know, there's, there's a big portion of that that's, you know, part of the data wrangling and part of the ETL that's involved. But then also, one of the things and, you know, this is, this is something as data engineers, and, you know, as we shift into, you know, creating a new role called the machine learning engineer, but it's a, you know, around the same around the same time type of concepts, one of the things that will never get out of, and probably the reason we like, being a data engineer versus a data scientist, is, there's still a lot of importing, making sure we have the right software packages, making sure that, you know, this version, you know, if we're using an Nvidia card that this version of Kudu and it's going to work with the version of TensorFlow that we're trying to line up. So there's still a lot of that to outside of just, you know, making sure that we have the right data and making sure that we have the right data sets, and hey, you know, we, you know, for using some kind of storage, like, you know, make sure we've allocated enough, right, like, if we're taking data that's off of a simulation, hey, do we do we have a big enough footprint to hold that hundred terabytes is going to be written that needs to be read as soon as it's written to. So there's, there's still a lot of fun, you know, things for us as data engineers to stay technical on that side. But there's new challenges with this, too.
Tobias Macey
0:18:53
And for anybody who's in the early stages of deep learning project, I'm curious what somebody of the edge cases or gotchas that they should be aware of are, and particularly ones that you've experienced yourself, as you're working on these types of projects.
Thomas Henson
0:19:08
Yeah, so, you know, some of the edge cases are, some of the things that start kind of looking at is, you know, it's, it's a little bit of a different approach, like I said, if you're coming from the Hadoop ecosystem, and looking at that, it's at this point, it's a little bit, it's a little bit more simple, right? Like, I was saying, like, you can, it's easy to get started, you know, from the perspective of, hey, you can set up an Fs mount, and just be able to, you know, point points your jobs at it, oh, and, you know, from a TensorFlow perspective, or pytorch, making sure, you know, GPUs are going to be the big piece, right, like, so, you know, it's recommended, you know, the install, you install it, and you use, you know, the different packages with different GPU cards that you have, can do it with CPU, like, if, you know, if you're just trying to do a PLC, you're still trying to do some testing to validate, Hey, I know, I know, the steps to kind of go through it, there are some of those different libraries that you can use, like I said, you know, GPU, for the, for the most perspective there. And then, you know, there's a lot of things calling and going back and forth. So making sure that you have chicken, you know, checking your card with the latest version of Kudu using TensorFlow or using pytorch. And so I would look to look to that. And another thing to do is, you know, start thinking about how this is going to grow. So just just like we kind of joked about, and then do because system is, you know, once people start understanding that you can do big data, the whole, you know, some of the reasons that these projects get funded, or because, you know, like, we were just talking about it, you know, the AI is a hot topic right now. And four or five years ago, Hadoop was the same thing so those projects get greenlit, right, you get funding to stand up those projects. But there's a lot of attention on you, too. So there's gonna be, there's gonna be a lot of ask, right, like, Oh, hey, you know, the analytics group down the down the, down the hall there, they're involved in a project, oh, wow, I'd love to get them to put some AI online. So, you know, you want to understand that and understand how that's going to grow. And kind of another thing, and you're seeing this goal, just from a data engineers perspective, too. But this is, you know, on the forefront just because of where we are, from a deep learning perspective, containerization is huge. So, you know, if you're, if you're a data engineer, you know, when we first started out in the new ecosystem, it was like, Hey, man, you know, has to be on, you know, has to be on bare metal we can virtualize and, you know, now we're going cloud, native cloud era releasing, you know, different versions. So, you know, from a deep learning perspective, you know, orchestration and management and just understanding containerization, like, if that's not something that you have, and that's something that I've been trying to catch up on, you know, in the last year. So I definitely make sure that, you know, you're up to speed on that, because that's going to, it's going to play an important part. And so, like I said, there's, you know, there's tools out there that will help you manage that orchestration layer. But on the back end, I mean, it's, it's, it's essentially containerizing right, to be able to get, do your scheduling and doing everything, kind of like what we've seen in the yarn, Hadoop ecosystem piece. And
Tobias Macey
0:21:45
in terms of the level of familiarity and understanding that's necessary for being able to build out the underlying infrastructure and work effectively with the data scientists on these deep learning projects. How much knowledge of deep learning and machine learning and some of the mathematics and fundamental principles behind it, should we as data engineers, be aware of, in order to be able to continue to progress in our careers and work effectively as these types of projects become more prevalent?
Thomas Henson
0:22:16
So that's a it's a super good question, I get that question a lot is like, hey, even just from the basics of, you know, should I understand the algorithm should know, the algorithms for talking about machine learning, we're talking about deep learning, like, how much should I be able to recommend and look at, you know, TensorFlow, and it's, it's such the software engineering answer, right, to say, it depends. But really, it does, it's going to depend, right, like, you know, if you're in a small organization, and, you know, you guys are just going down, the don't going down the path, you probably know, maybe, maybe you have a data scientist, maybe, you know, maybe it's more of a, you know, data analyst in your organization, then you're going to want to be able to handle and be able to, you know, carry some of that now, I'm not going to say that you're going to want to recommend, oh, you know, we should use, you know, CNN here, or for doing machine learning, like, Hey, we should use, you know, PCI, or, you know, decision trees are from that perspective, but you definitely want to have a little more understanding around it. So that, you know, when it comes to your part, and your, your role in the organization, you can understand some of the tweaking and some of their kind of thought process around it, and, you know, add, you know, add something to the table now, in a large organization, right, that's maybe more mature in their analytics journey, or the deep learning journey, then you're going to be, you're going to be able to focus, you know, not going to have to focus as much on understanding the underlying, hey, you know, how does that, how does this math, you know, work, and, you know, what's, you know, what are the weights and biases, and you know, why there's so many different layers there. So, you wouldn't have to focus as much in a larger organization. But I will tell you one thing that I've found, and I said, I came from the data engineering side, and, you know, understood a little bit about the algorithms, but didn't really didn't really focus on them. One of the things that I like more about deep learning, then machine learning is, it's, it's really a little bit different math, like, it's a little more basic to, I think, I think, I've heard people say that, you know, whenever you're talking about it, you know, you can get away with, like, you know, the first Sunday of calculus from a deep learning perspective, versus, with machine learning, it's a lot more complex, right? When the algorithms and kind of what you're doing, and a lot of it goes back to what we were talking about, how about how the data is broken up from a deep learning perspective, is, it's really just, you know, it's really just matrix math, right? It's, you know, it's matrix algebra to be able to, you know, stack all these ones and zeros, or, you know, for us an RGB, you know, ones and zeros and trees, and, you know, all these different pieces together. So, we're using a lot of easy basic math, it's just really big math. So that's a long answer to say that it depends. But it really is going to, it's going to depend on your organization, and kind of where your role is. So I would encourage you, from a career perspective, to be a little bit little bit familiar with it, just, you know, have have a natural curiosity to it, but don't go, I wouldn't say that you have to go deep, right, you're not,
0:24:50
you're not going to have to go back and get a degree or, you know, you're not going to have to know the intricacies of you know, everything about it. But especially with the algorithms that you're using, or the different, you know, neural networks that you're that you're implementing in your organization, I'd be pretty much I'd be pretty familiar with there. But I wouldn't, I wouldn't stand up and put myself as I'm the one that's going to recommend which, you know, which approach we take, and you mentioned earlier to about the possibilities of leveraging some of these deep learning capabilities in the data preparation and ATL processes. So can you talk a bit more about the different ways that we can leverage the capabilities that are promised by deep learning as part of our own work in the data management process? Yeah, that's a good point. You know, when, whenever I kind of keyed on the point that, you know, we use supervised learning a lot. And just kind of recap, you know, if you think about supervised learning, that's where we have these, you know, going back to our cat photos, right, we have a lot of images, a, this is an image that contains a cat, this one doesn't contain a cat, right? And so we know the end outcome we're looking for when we're doing that. And then I talked about, you know, how unsupervised learning is kind of, you know, on the forefront and, you know, something that we're seeing, but you can use unsupervised learning to help out with some of your eta and some of your data wrangling. Right? So, unsupervised learning is where we have a, hey, have a million images and I just want you, you know, wants you to be able to classify them, right? Unless you go feed them in so this can help you to group so I talked about, you know, I don't think we're going to get out of BTL and I don't think we're going to get out of data wrangling for a while, but you can use, you know, like unsupervised learning to be able to pull and generate and put, you know, put some kind of order to all this structured, unstructured data, we have to. So, think about it, you know, kind of, you know, one of the famous examples, you know, that we've, we've done before is like, you know, sentiment analysis, right. So, think about when you're doing sentiment analysis, if you've ever walked through a tutorial on Twitter, but now, you can, you know, think about that, from the same perspective of, hey, do you know, there's, you know, we can, we can train our neural network to kind of just look at a whole bunch of images and kind of put all those to some kind of structure. And so, if you think about it, if, you know, if your job was to find these train data sets, right, like, you could, you can use an unsupervised learning to be able to, you know, categorize those and put them in clusters. So that, hey, you know, instead of looking at a million pictures, maybe I'm only looking at 100,000, right.
Tobias Macey
0:26:58
And as far as how that plays into the infrastructure requirements, and the processing requirements are actually being able to execute these ATL jobs as we're incorporating deep learning, what kind of impact does that have?
Thomas Henson
0:27:12
Yeah, so from a storage perspective, there's a big, big footprint, especially when we're talking about, you know, talk a little bit about the different environments. And so just, you think of your training environment, this is where we're building out those algorithms, we're training and we're hoping that, hey, you know, if we're able to what we're training our neural network to be able to do the outcome. So back to our cat identifier, you know, we're sending, you know, millions of images through to be able to train those models. And so, you know, to be able to do that, right, you have to have the storage or the output to be able to do that. And then you also have the app data throughput, right? We're talking about, you know, from a perspective of the, you know, these are some of the most powerful chips on the planet, right, you know, thinking about your GPUs. And so there's some, there's some specifically, you know, specifically for doing things on prem, there's some requirements there from the just how do you how do you get enough power on a 40 mile to GPU these GPUs, right? Like, you know, you're limited to how much power they're going to pull, right. And then there's heating and some some of the other requirements. So there's a lot of processing that goes on in that part of the workload. And then when we flip to, okay, we've got our, we've got our image, now, let's train it out, you know, in the world world, and see if it's working, that's more than the inference, right. So that's where we talk about an inference environment. So the best, the easiest way I like to think about it is becoming from a, you know, application development, you know, background is you look at, you look at your training environment, think of that as kind of test Dev and think of your influence, you know, it's kind of more your production environment. And that's where you really, really train, hey, we're going to see a whole bunch of images in there, and we're not doing any more training. And we're saying, Hey, can you know, Kennedy, identify a cat or not a cat? Or, you know, can it can it drop down this practice road? Right? You know, or do we need to go back and train it some more. And so, but with that being said, you know, an application development environment, new new, think about it, like, Oh, your test dev normally, that's not your biggest footprint, right? But in deep learning, this is, this is where the majority of your data lives, right? Because like I said, you're you're trying, you're trying to get the best amount of data and the most data into into training these algorithms, so that whenever you go into production, or you want to test it, and inference, if you've made the best decisions, and particularly in terms of the infrastructure layer, there have been a lot of new offerings coming out from the various cloud providers that are aiming to provide access to pre trained neural networks, or managed services around being able to execute these different deep learning algorithms and be able to pipeline data into them and out of them.
Tobias Macey
0:29:34
So I'm wondering what your thoughts are in terms of the build versus by decision around deep learning with the availability of these managed services. And then also, particularly at the layer ETL of does it make sense to build out your own additional capacity for being able to run these algorithms or just start consuming these managed services, at least as an initial step of simplifying and enhancing your capability to provide meaningful data processing and ATL for feeding into the end product that you're actually aiming for?
Thomas Henson
0:30:10
Yeah, no, I mean, it's a, it's a good way to kind of look at it is like what you have, whenever we're talking about, you know, with the cloud providers, you know, you have this, let me think of it as, like a catalog sometimes of, Hey, you know, there's these different there's different approaches and different algorithms that can use and turn it towards my data, right. And it really, you know, think of it as a service, right? Like, it's going to almost be your data scientist, you know, in the cloud, or data scientists behind your browser to say, Hey, you know, here's some data, why don't you, you know, test this out and see, if, you know, you can use this for your average with that, you know, if, you know, it's a good way to start, you know, looking at things and seeing Hayes, you know, are things viable for what we want to do. But at the same time, we talked about the data gravity, like, whereas the majority of your data live, right, like, if the majority of your data exists in the cloud, you know, and you're wanting to take part in these managed services just have to understand, you know, from a business perspective, like, Hey, you know, you're, you know, you're offloading some of your data scientists are your data analysts, right? Like, you're not having to make this make as much of an approach and a research perspective of, Hey, you know, we're trying to build out and trying to see, you know, which algorithm is going to be best for our data sets. So, it does give you kind of a guideline to be able to test out and be able to look at that. But then at the same time, you know, if you had, you know, if you have, if you're have a lot of your data, like on prem, and you built your own systems there, and you have your own research and your own talent in house, and that's probably going to be the best approach for you, right? Like, you're, you're going to want to build your own systems, and you're going to want to build your own algorithms, because, you know, your data is unique. And so it's not an approach that you want to take to be a, hey, you know, we're gonna, we're gonna transfer and move our data up, you know, or, or, or send it up into some of these managed services. So, you know, it kind of goes back to, you know, through the years, we go through different debates in different areas is like, Hey, you know, is are we going to offload? Are we going to outsource to consultants, or anything like that. So it's just really about how that you want to approach from that perspective. So for small teams that maybe don't have a data scientist, you know, there are tools that are both on premise off premise, the same kind of decision, right? Like, how do we how are we going to approach how we're going to build out our models? And do we want to, you know, especially if you're just starting out, there's, you know, products like data robot, and other pieces that give you give you the ability to look and see what you're doing, and give your data test and the sample to say, Hey, you know, these items might work for you, right, like this might give you the answer that you're looking for. But from a deep learning approach right now, I mean, I think, you know, we're, we're starting to see those products, and those tools come out as well, we've had them like I said, for machine learning perspective for for some time, but you're starting to see those being integrated in a lot of tools. So I think we'll continue to see this debate continue, and will continue to see, you know, product offerings in other maybe not even analytic tools that are starting to take advantage of deep learning.
Tobias Macey
0:32:47
What is your personal litmus test for determining when it is useful and practical to use deep learning as opposed to traditional machine learning algorithm, or even just a basic decision tree for providing a given prediction or decision on whatever the input data might happen to be?
Thomas Henson
0:33:09
Yeah, I mean, so when never we look at that, traditionally, what I've seen right now, like, kind of going back to what we're talking about with, you know, it's really good when we're from a deep learning perspective, and we're talking about image data, and we're talking about, you know, video data, or audio files, or, you know, those kind of rich media types, I still see, you know, for the most part, whenever we're talking about, you know, like, if we're looking for, you know, classic example, would like, Hey, can you predict housing, you know, housing rates, and, you know, mortgage rates, you know, or can you predict housing prices, and a certain, you know, in a certain geo, a lot of those approaches, since you already kind of have the feature sets, they're really good to work in machine learning, it's not that you can't use them, there's plenty examples out there to use deep learning for it. But traditionally, you know, I still see that as the use case. All that being said, though, you know, like I said, I still come from the data engineering side. So, you know, if you're listening to the podcast, and your data engineering, there's a data scientist in your organization to, you know, maybe you should, maybe you should rely on them, you know, but be curious about it to, you know, maybe ask them, Hey, you know, why, you know, if it's, if it's different than, than what you think, you know, kind of take that approach, but that's kind of what I've seen, it's that, that's kind of my rule of thumb. But I'm not gonna argue with a data scientist, if they want to, they want to kind of test some of that out. And like I said, there's plenty of examples out there. But I still see, you know, for we're talking about, like, some of the traditional, you know, traditional, you know, semi structured, or unstructured data, or data where we just, we have all the, all the points for us, it's not, you know, video files, or audio files or videos, then, you know, we're, we can still stick with the machine learning approach.
Tobias Macey
0:34:38
And with deep learning algorithms, they're often a black box in terms of identifying what features contributed to the given decision that it outputs and with regulations, such as GDPR in particular, but others that are either active in different locations, or in process of being formulated, they are introducing requirements to be able to identify what were the different factors that played into whatever the decision might be, especially when it impacts a an individual. So how does that factor into your determination of what approach to take for a given project as far as whether you use deep learning or machine learning or just a standard, you know, Boolean logic based approach?
Thomas Henson
0:35:24
Yeah, I mean, that's, that's definitely an interesting question. You know, when we think about it from any kind of regulation, we're talking about GDPR and this specific example. But, you know, if you think about what's going on from the, you know, autonomous driving cars, right? Like, hey, you know, at some point, you've got to figure out, like, where do we make a decision for the car to drive into a brick wall or to, you know, hit somebody on a bicycle, right? How can we go back and kind of prove that, and then it's really the regulation, maybe not even the technology that's going to, you know, because that's going to keep me driving. So, same thing, same thing here, we're talking about, can you go back and prove, you know, which, which weight which bias that we had, from a deep learning perspective, you know, there are ways there are ways whenever you're looking at the neural networks to be able to do, you know, back propagation, and kind of look back and see, okay, you know, where did we wait, one feature, you know, how does all that kind of work? I don't know, from a legal perspective, how that would how to kind of play out with GDPR, right? Like, what's the level of proof that you would need, but yeah, those are, those are different, definitely, you know, different challenges, you know, that we're looking at, from, not even from a data engineers perspective, or from a data scientist, but like, we're all like, these are our projects, right? Like, we are all involved in this conversation as well. So those are all, you know, like, interesting points, but I don't think it's something that, you know, that we're going to be able to solve. But, you know, from the GDPR, there's so many different layers to that we think that, you know, from a regulation perspective, being able to take and being able to, you know, go back and say, Hey, okay, we have different data elements that aren't going to leave, you know, this, this specific border, the specific, you know, GPS coordinates, right, like, you know, data, is data going to stay in the country in which it originated, well, how do you take data that's been originated in a country, but you've trained models on it, and you've deployed the model other way, right? Like, you're not moving the data, but the, but how, you know, how does all that kind of tie in. So those are, those are huge points, you know, huge, huge things that we're will see play out for years and years,
Tobias Macey
0:37:20
it's a whole big ball of yarn to try and untangle. And then there are the aspects of bias in terms of how the different features are weighted, and what the training data has, in terms of inherent bias because of how it's collected, or how its represented. And that's something that is an ongoing conversation, and one that I don't know that we'll ever find a complete solution to, but something that is definitely useful to keep in mind as we build these different products. Because it's important to be thinking about it, even if we don't have a perfect answer. Because Don't let the perfect be the enemy of the good, especially as it pertains to people's privacy, or rights or inherent biases, and how they're represented in the software and projects that we build.
Thomas Henson
0:38:01
You know, I think that could be like a multi part, maybe even ongoing podcast for for you episode to have, right, just just peeling back the layers of GDPR, and what's our, you know, what's our responsibility? And, you know, just just things to think about, right.
Tobias Macey
0:38:16
And for anybody who wants to learn more about deep learning from the perspective of a data engineer, or who might be interested in deploying it for their own projects, or building projects based on it, what are some of the resources that you have found useful, and that you recommend other people take a look at?
0:38:34
Yeah, so like I said, started out with the machine learning course, from Coursera, I'm actually going through right now, the deep learning boot camp, I think it's like the deep learning AI. So it's engineering course, around it. Python development, really cool hands on with TensorFlow. So, you know, that's, that's been very interesting. And then from, from my own perspective, so like I said, I'm a poor site author. And, you know, one of the things that that I, that I went through and did last year is, you know, release it this year was did a data engineers course, on, you know, kind of TensorFlow and use something called TF learn, which is an abstraction layer for TensorFlow. And so it just gives you the ability to just think of it is from a data engineer perspective, like, think of how we went with Pig Latin, right? Like Pig Latin, could take you from 140 lines of Java down to, you know, eight or 1010 lines of code. Same thing from A to learn perspective. So I went through, you know, I've created a course specific to data engineers, and how to kind of get started with, hey, you know, build your first, you know, build your first neural network. So, a lot of resources around that, like, say, there's a lot out there on Coursera, there's some free courses as well, on a Google Google has a machine learning boot camp, and it kind of just, it kind of goes through, I think they say it's like, 30 days or something like that. But I think it's something I was able to knock out, like, two weeks, you know, offers labs and everything. And usually the explanations and, you know, little quizzes in it, too. So there's a lot of resources out there, it's very popular, huge documentation out there for TensorFlow. So just get out there and start looking at it. And, you know, just if you don't understand it, it's okay. Right. Like, that's, that's the thing, just get in, start learning it after, you know, after you can keep going through repetition, you'll start to understand,
0:40:06
and are there any other aspects of deep learning from the perspective of a data engineer that we didn't discuss yet that you think we should cover before we close out the show
Thomas Henson
0:40:15
now, like I said, I think the two you know, the three biggest things or biggest things I would look look to from a data engineers effective is just kind of watching what's going on projects like submarine sparks project hydrogen, and then looking into what in videos, get some documentation and some some blog posts out there on what they're doing. And then video rapids, I think that's going to have a huge impact on our day to day jobs as data engineers, just from the aspect of a speeding up some of the CTO pipelines, and then also being able to access what's going on with every TensorFlow, or pie torture cafe,
Tobias Macey
0:40:48
right. And for anybody who wants to follow along with you, or get in touch or see the work that you've been doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
0:41:05
Yeah, so the biggest tooling, and the biggest gap, probably for data management, it's, it's going to be in the ATL arena. I mean, we mean, it's, it's something it's how I started out, right. Like I said, I volunteered for a job and volunteered for the job. I didn't have experience in data engineering and didn't have it in the Hadoop ecosystem. So my first job was doing ATL right. And we keep going through the year saying, Hey, this is something that we're going to fix, right? Like, hey, we have this tool or that tool, but and I'm not saying it's not getting better. But I mean, I think it's one of those things, you know, until we can train the machines to do it for us. I think it's always going to be something we do.
0:41:43
All right. Well, I appreciate you taking the time to join me and share your experiences working with deep learning and how it plays into your work. As a data engineer. It's definitely useful and interesting to get that background and keep an eye on different areas of concern for people working in the industry. So I appreciate that and I hope you enjoy the rest of your day. You too. Thanks.

Building Machine Learning Projects In The Enterprise - Episode 69

Summary

Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Kevin Dewalt about his experiences at Prolego, building machine learning projects for Fortune 500 companies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • For the benefit of software engineers and team leaders who are new to machine learning, can you briefly describe what machine learning is and why is it relevant to them?
  • What is your primary mission at Prolego and how did you identify, execute on, and establish a presence in your particular market?
    • How much of your sales process is spent on educating your clients about what AI or ML are and the benefits that these technologies can provide?
  • What have you found to be the technical skills and capacity necessary for being successful in building and deploying a machine learning project?
    • When engaging with a client, what have you found to be the most common areas of technical capacity or knowledge that are needed?
  • Everyone talks about a talent shortage in machine learning. Can you suggest a recruiting or skills development process for companies which need to build out their data engineering practice?
  • What challenges will teams typically encounter when creating an efficient working relationship between data scientists and data engineers?
  • Can you briefly describe a successful project of developing a first ML model and putting it into production?
    • What is the breakdown of how much time was spent on different activities such as data wrangling, model development, and data engineering pipeline development?
    • When releasing to production, can you share the types of metrics that you track to ensure the health and proper functioning of the models?
    • What does a deployable artifact for a machine learning/deep learning application look like?
  • What basic technology stack is necessary for putting the first ML models into production?
    • How does the build vs. buy debate break down in this space and what products do you typically recommend to your clients?
  • What are the major risks associated with deploying ML models and how can a team mitigate them?
  • Suppose a software engineer wants to break into ML. What data engineering skills would you suggest they learn? How should they position themselves for the right opportunity?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary

Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • To start, can you explain the problem space that Alegion is targeting and how you operate?
  • When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?
  • What are some of the biggest challenges associated with managing human input to data sets intended for machine usage?
  • For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?
    • What tools and processes do you have in place to ensure the accuracy of their inputs?
    • How do you prevent bad actors from contributing data that would compromise the trained model?
  • What are the limitations of crowd-sourced data labels?
    • When is it beneficial to incorporate domain experts in the process?
  • When doing data collection from various sources, how do you ensure that intellectual property rights are respected?
  • How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?
    • What kinds of metadata do you track and how is that recorded/transmitted?
  • Do you think that human intelligence will be a necessary piece of ML/AI forever?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA