strongDM

Serverless Data Pipelines On DataCoral - Episode 76

Summary

How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Raghu Murthy about DataCoral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DataCoral is and your motivation for founding it?
  • How does the data-centric approach of DataCoral differ from the way that other platforms think about processing information?
  • Can you describe how the DataCoral platform is designed and implemented, and how it has evolved since you first began working on it?
    • How does the concept of a data slice play into the overall architecture of your platform?
    • How do you manage transformations of data schemas and formats as they traverse different slices in your platform?
  • On your site it mentions that you have the ability to automatically adjust to changes in external APIs, can you discuss how that manifests?
  • What has been your experience, both positive and negative, in building on top of serverless components?
  • Can you discuss the customer experience of onboarding onto Datacoral and how it differs between existing data platforms and greenfield projects?
  • What are some of the slices that have proven to be the most challenging to implement?
    • Are there any that you are currently building that you are most excited for?
  • How much effort do you anticipate if and/or when you begin to support other cloud providers?
  • When is Datacoral the wrong choice?
  • What do you have planned for the future of Datacoral, both from a technical and business perspective?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Why Analytics Projects Fail And What To Do About It - Episode 75

Summary

Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Eugene Khazin about the leading causes for failure in analytics projects

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • The term "analytics" has grown to mean many different things to different people, so can you start by sharing your definition of what is in scope for an "analytics project" for the purposes of this discussion?
    • What are the criteria that you and your customers use to determine the success or failure of a project?
  • I was recently speaking with someone who quoted a Gartner report stating an estimated failure rate of ~80% for analytics projects. Has your experience reflected this reality, and what have you found to be the leading causes of failure in your experience at PrimeTSR?
  • As data engineers, what strategies can we pursue to increase the success rate of the projects that we work on?
  • What are the contributing factors that are beyond our control, which we can help identify and surface early in the lifecycle of a project?
  • In the event of a failed project, what are the lessons that we can learn and fold into our future work?
    • How can we salvage a project and derive some value from the efforts that we have put into it?
  • What are some useful signals to identify when a project is on the road to failure, and steps that can be taken to rescue it?
  • What advice do you have for data engineers to help them be more active and effective in the lifecycle of an analytics project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building An Enterprise Data Fabric At CluedIn - Episode 74

Summary

Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

  • Introduction

  • How did you get involved in the area of data management?

  • Before we get started, can you share your definition of what a data fabric is?

  • Can you explain what CluedIn is and share the story of how it started?

    • Can you describe your ideal customer?
    • What are some of the primary ways that organizations are using CluedIn?
  • Can you give an overview of the system architecture that you have built and how it has evolved since you first began building it?

  • For a new customer of CluedIn, what is involved in the onboarding process?

  • What are some of the most challenging aspects of data integration?

    • What is your approach to managing the process of cleaning the data that you are ingesting?
      • How much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution?
    • How do you preserve and expose data lineage/provenance to your customers?
  • How do you manage changes or breakage in the interfaces that you use for source or destination systems?

  • What are some of the signals that you monitor to ensure the continued healthy operation of your platform?

  • What are some of the most notable customer success stories that you have experienced?

    • Are there any notable failures that you have experienced, and if so, what were the lessons learned?
  • What are some cases where CluedIn is not the right choice?

  • What do you have planned for the future of CluedIn?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

A DataOps vs DevOps Cookoff In The Data Kitchen - Episode 73

Summary

Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • "There aren’t enough data conferences out there that focus on the community, so that’s why these folks built a better one": Data Council is the premier community powered data platforms & engineering event for software engineers, data engineers, machine learning experts, deep learning researchers & artificial intelligence buffs who want to discover tools & insights to build new products. This year they will host over 50 speakers and 500 attendees (yeah that’s one of the best "Attendee:Speaker" ratios out there) in San Francisco on April 17-18th and are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Bergh about the current state of DataOps and why it’s more than just DevOps for data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • We talked last year about what DataOps is, but can you give a quick overview of how the industry has changed or updated the definition since then?
    • It is easy to draw parallels between DataOps and DevOps, can you provide some clarity as to how they are different?
  • How has the conversation around DataOps influenced the design decisions of platforms and system components that are targeting the "big data" and data analytics ecosystem?
  • One of the commonalities is the desire to use collaboration as a means of reducing silos in a business. In the data management space, those silos are often in the form of distinct storage systems, whether application databases, corporate file shares, CRM systems, etc. What are some techniques that are rooted in the principles of DataOps that can help unify those data systems?
  • Another shared principle is in the desire to create feedback cycles. How do those feedback loops manifest in the lifecycle of an analytics project?
  • Testing is critical to ensure the continued health and success of a data project. What are some of the current utilities that are available to data engineers for building and executing tests to cover the data lifecycle, from collection through to analysis and delivery?
  • What are some of the components of a data analytics lifecycle that are resistant to agile or iterative development?
  • With the continued rise in the use of machine learning in production, how does that change the requirements for delivery and maintenance of an analytics platform?
  • What are some of the trends that you are most excited for in the analytics and data platform space?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Customer Analytics At Scale With Segment - Episode 72

Summary

Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Segment is and how the business got started?
    • What are some of the primary ways that your customers are using the Segment platform?
    • How have the capabilities and use cases of the Segment platform changed since it was first launched?
  • Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the overall structure of Segment and the driving force behind their design and use?
  • What are some of the best practices for structuring custom events in a way that they can be easily integrated with downstream platforms?
    • How do you manage changes or errors in the events generated by the various sources that you support?
  • How is the Segment platform architected and how has that architecture evolved over the past few years?
  • What are some of the unique challenges that you face as a result of being a many-to-many event routing platform?
  • In addition to the various services that you integrate with for data delivery, you also support populating of data warehouses. What is involved in establishing and maintaining the schema and transformations for a customer?
  • What have been some of the most interesting, unexpected, and/or challenging lessons that you have learned while building and growing the technical and business aspects of Segment?
  • What are some of the features and improvements, both technical and business, that you have planned for the future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Deep Learning For Data Engineers - Episode 71

Summary

Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off
  • Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it?
  • What has been your personal experience with deep learning and what set you down that path?
  • What is involved in building a data pipeline and production infrastructure for a deep learning product?
    • How does that differ from other types of analytics projects such as data warehousing or traditional ML?
  • For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of?
  • What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate?
  • What are some ways that we can use deep learning as part of the data management process?
    • How does that shift the infrastructure requirements for our platforms?
  • Cloud providers have been releasing numerous products to provide deep learning and/or GPUs as a managed platform. What are your thoughts on that layer of the build vs buy decision?
  • What is your litmus test for whether to use deep learning vs explicit ML algorithms or a basic decision tree?
    • Deep learning algorithms are often a black box in terms of how decisions are made, however regulations such as GDPR are introducing requirements to explain how a given decision gets made. How does that factor into determining what approach to take for a given project?
  • For anyone who wants to learn more about deep learning, what are some resources that you recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So you should check out our friends at Lynn node with 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform if you need global distribution they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai go to data engineering podcast comm slash lindo, that's Li n. o. d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of the show. Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you're tired of wasting your time, cobbling together scripts, and workarounds to give your developers data scientists and managers the permissions that they need, then it's time to talk to our friends. At strong dm. They've built an easy to use platform that lets you leverage your company's single sign on for your data platform. Go to data engineering podcast.com slash strong dm today to find out how you can simplify your systems and go to data engineering podcast. com to subscribe to the show. Sign up for the mailing list. Read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data platforms for even more opportunities to meet. listen and learn from your peers. You don't want to miss the strata conference in San Francisco on March 25, and the artificial intelligence conference in New York City on April 15, both run by our friends at O'Reilly Media good at engineering podcast.com slash strata con and data engineering podcast.com slash API. com to register today and get 20% off your hostess, Tobias Macy. And today I'm interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects. So Thomas, could you start by introducing yourself?
Thomas Henson
0:02:20
Hi. So I'm Thomas Henson, a Pluralsight author and involved in the data engineering community and also work for Dell EMC in our unstructured data team. So I've been around data engineering and really around the Hadoop ecosystem probably for the last six years since since before Hadoop to Dotto and I've just been a part of that community and love it
Tobias Macey
0:02:41
And do you remember how you first got involved in the area of data management?
Thomas Henson
0:02:44
Oh, yeah, hundred percent. So, you know, going through college thought for a long time that I was going to be a DPA so really, that's kind of what I was targeting. And so when I graduated, you know, the job market being what it is and it doesn't matter You know what, what error right you're in, you know, especially when you're getting out of college, you're you're having to apply for a lot of different positions. And I actually got my first role as a web developer so totally different right and then being a DPA so but always kind of had that passion. And I guess it would be I would be considered a full stack developer. So I did do some database management to some extent for applications but nothing nothing too too ingrained like a traditional BBA. And then you know, lo and behold, a few years later, there was a research project that came up and it was going, I didn't know at the time that it was going to be a big data project, but I knew was gonna I was going to require a lot of lot of information and just really take me outside of my comfort zone. So a volunteer to get on that project. And it turned out we were using Elastic Search at the time. And then we we rotate it into using Hadoop so you know, download the hard work sandbox. And then Cloudera had one at the time, too, and that that was kind of that was kind of my path. And, you know, when I went to my first
0:03:53
I think Hadoop summit, and, you know, from there, I just started looking at, and just really saw this community and really saw my opportunity to get into data from, you know, what I looked at, you know, in my college days, so, I haven't looked back since, and recently, you've started getting into the area of deep learning and experimenting with that.
Tobias Macey
0:04:12
So can you start by giving a bit of an overview of what deep learning is, for anybody who isn't familiar with that terminology?
Thomas Henson
0:04:18
Yeah, so from a data engineers approach, I haven't really, you know, I didn't really spend much time on the algorithms and just kind of focusing on some of the machine learning pieces. And, you know, that portion, and I'm not saying it was like, kind of a dark box for data engineers, but it's not something that really, you know, spent a lot of time like, I was worried about being able to stand up our Hadoop cluster, or stand up our environment or, you know, writing, you know, at the time MapReduce jobs, or Spark jobs, and, you know, those kinds of pieces, and kind of left the data engineering to the data engineers, but, you know, slowly started looking into, you know, okay, well, you know, I know, which algorithms were using, let me let me find a little bit more, you know, kind of underneath the covers about, you know, what those are, and so started kind of having that approach to looking at it, but specifically, you know, if you're looking, you know, as it from a data engineers perspective, or even a data science perspective, you know, the real difference in key between deep learning and machine learning is going to be the use of neural network. So you're using neural networks to be able to, you know, go through and analyze your data versus, you know, with machine learning, it's a more of a approach where, hey, you know, we're taking all these few different feature sets, like, one of the famous examples is, if to identify cats on the internet. And I don't know why that you want to be able to identify cats from YouTube videos, maybe just makes for amazing YouTube videos, I don't know. But that's, that seems to be like the first use case. And so if you think about, you know, the machine learning approach to how you're going to, you know, identify a cat from a video is, you know, you, you're going to program in the different features. So, like, you know, features in a, you know, how, what are the ears of, like, you know, doesn't have hair, even though there is hairless cats, but, you know, just, you're going to assign those right, the whisker link and some some of those other pieces. And you're going to run those through your different algorithms. So if you're using SPD, or if you're using, you know, some kind of decision tree, you're going to pick out the algorithm, and you're going to test and run that through. And that's the machine learning approach. But, you know, from a deep learning approach, what you're going to do is, you're going to have this labeled data set, and you can have unlabeled data set, let's just keep it with it, labeled data sets here. And you're going to feed those images of those cats through, and you're going to be able to identify and let the neural network decide, okay, this features or hair or the whiskers, you know, what, what makes the biggest difference there, and you can kind of evaluate how that looks. So, it's a, it's a different approach from what we've done for machine learning. But just as a data engineer, it was just kind of fascinating to me to kind of, you know, wanted to step back and take some time to really learn kind of what our data scientist and, you know, our team kind of go through to, hopefully, you know, make me a better data engineer, so I can understand algorithms and kind of go through that process.
0:07:02
So at the end of 2017, a couple of a couple of groups. So I do a pop podcast with some other people in the data engineering and, you know, the data analytics world. And so there were a couple of us Aaron banks, and Brett Roberts, we, we were looking at doing a Coursera course, and just being able to kind of go through it. And we wanted to do the most famous one. So, like, you know, the most famous machine learning course, with, you know, engineering, taking everybody through, I think he's taught more people on the planet about that, than probably anybody else. And so we were like, Okay, this is the most popular course on the planet, let's kind of go through it. And it was very hard. So, you know, we kind of looked at it as, like, Oh, this, you know, an online course, is something we can do together, we record videos after going through it. But it really took me down more of a math path. And I mean, I guess I should have known that. But so, you know, after kind of going through that, and really understanding more about machine learning, just doing some work in you know, in my job at Delhi MC, I'm part of a group called the unstructured data solutions team. And so, you know, being a part of that group, there's a lot of things going on in the deep learning world, that was kind of challenged by some of my co workers and the other business units to understand more about that. And so, you know, I took what I learned in the machine learning area, I kind of really apply that to what's going on from a deep learning. And so I started learning, you know more about TensorFlow and pytorch, and what's going on from a GPU specific basis, and just kind of going down that path. So, you know, it wasn't that I was targeting At first, the deep learning and just kind of thought would be good for me to understand, because I got, you know, we continually get questions is, you know, somebody who's advocating out in the data engineering community questions around data science. And so I just thought for me to be more well rounded, that it would be good for me to be able to answer some of those questions around, have a better understanding for it, and just kind of evolved into, hey, we need to check out what's going on from a TensorFlow perspective, and just kind of hadn't looked back for the last year. So
Tobias Macey
0:08:43
and particularly from the perspective of a data engineer who's working on building out the infrastructure and the data pipelines that are necessary for feeding into these different machine learning algorithms or deep learning projects, what is involved in building out that set of infrastructure and requirements to support a project that is going to be using deep learning, particularly as it compares to something that would be using a more traditional machine learning approach that requires more of the feature engineering up front, as opposed to just being in the label the data sets for those deep learning algorithms?
Thomas Henson
0:09:19
So that's a good question. I as you start looking at it, and, you know, kind of the way that I approach it one with my learning, and just kind of the way that I like to describe it is just you think about, you know, my experience from the Hadoop ecosystem, right? And, like, how, how does it differ, you know, from what we're doing deep learning to what's going on from the Hadoop ecosystem perspective, and you think about it, you know, in the, you know, in the new world, you know, your data in HDFS, or what we're trying to analyze it still, you know, it's somewhat structured, you know, or semi structured, we call it or, you know, we call it unstructured data. But it was really, it was really like, still a lot of text data in other portions like that, versus what we're doing from a deep learning approaches, you know, we're talking about, you know, mostly, you know, image data, or voice recognition, just rich media, right, like, even video data. And that's really kind of one of one of the key portion. So with that, you know, when we talk about big data in Hadoop, we were talking about large data sets. But now, you know, in a deep learning side, we're talking about massive data sets, right? Because, I mean, how much how much video data does it take to, you know, create the next driverless car, right, we're still we're still going through that, and figuring that out. But I mean, you can just imagine, you know, if you're doing any kind of simulations, or anything like that, I mean, we're talking about lots of lots of sensors, and lots of lots of data points. And so there's some challenges there. And, and one of the big keys to that's really kind of push forward deep learning and why you're seeing other projects from the traditional ecosystem, like, so there's projects like project hydrogen, submarine, and even what Nvidia is doing with rapids, they're trying to get more into the GPU. And so the GPU is giving you the ability to analyze data faster, even do ATL faster. And, you know, that's really kind of accelerating it. So it does bring up challenges whenever we're talking about building out that data pipeline, and how you want to, you want to kind of progress to it. And, you know, there's, there's not really any answers just yet to how it's all going to kind of go, because it's still somewhat fluid, right? Because, like, we know, you know, if we look at what we're doing, you know, let's just take TensorFlow, for example, right. So like, what you're doing, when you're setting up a TensorFlow environment, you know, it might be something as simple as you're just setting up, you know, different shares, you know, so you have some, you have some NFL mouth, right, where you can just analyze, you know, all this data, and, you know, you're still orchestrating it, and you're still going through that portion, but to, to build up those data pipelines, you know, you might just have one, one data set, right, or, you know, one large set of that data. And so I think, really what the key and what we're in, maybe in 2019 and beyond is we're starting to look and say, hey, how can we bridge that data with what we will, you know, what we have in our Hadoop ecosystem, right, or what we have in other data sets. And not that I'm saying Hadoop is going to be the key to that, or, you know, even, you know, what we call in the Hadoop ecosystem, but it's, it's still trying to kind of interesting to see how that plays out, right, like, you know, we're at this point, now, we're taking advantage of what's going on from a GPU perspective. And now we want to now we want to do like we do, you know, with other projects throughout the years, right, you know, that we've seen in the past is, can we marry this with other data that we have, or other decisions that we've already made. So it's, it's real interesting. And, you know, there's, like I said, a lot of different lot of different approaches to it. And we're still kind of going down that path.
Tobias Macey
0:12:16
And I think your point to about the fact that deep learning is particularly applicable to these projects that are focused on rich media, as you put a video or images or audio, it starts to look more like a content delivery pipeline than necessarily the traditional data pipeline that we're used to where we might be working more with discreet records, or, you know, flat files on disk, or things like that, that have a lot of structured aspects to it, where there might be similarities between records that are conducive to different levels of compression, or aggregation. Whereas with video, in particular, and even audio, there is a lot less of that similarity from second to second within the content, but also between files, because there are so many different orientations that are possible for an image frame or anything like that. So just conceptually, it requires a much different tack as to how you're managing the information, and how you're providing it to the algorithms that are actually processing it,
Thomas Henson
0:13:16
oh, hundred percent, I mean, we're talking massive amounts of storage, right, to be able to, you know, like said, thinking about video data coming in, and most of you know, most of that, and it's format, it might be compressed to some extent, but it's, there's not gonna be any Dee duper, or some kind of compression that we can do, you know, for the most part, right, like, you know, one video file of a car driving down the road versus, you know, a different view of that same one, it's, it's, you know, it's not going to do well, it's not, it's not going to have that. So there are some challenges there. But one of the things is, you know, I did say, Hey, you know, when we look at it from a rich media type, you know, traditionally what we did, we're talking about Spark and Hadoop, and, you know, anything in that Hadoop ecosystem is, you know, still kind of text based data, what's still the same thing here. So, I just want people in the audience to understand where still breaking the data down, we're just breaking it down into, you know, let's just say that we're doing gray scale, right, we're still we're still breaking down into matrices of zeros and ones, but it's a lot of zeros and ones, right, for, for one video, or an image or, you know, anything from an audio perspective.
Tobias Macey
0:14:16
And particularly for formats like video or audio, where the information in relation to the other attributes of you know, the frame to frame is important and contextual, it makes it much more difficult to identify what are the logical points where we can split it, versus something like a park a file, where if an individual file starts to grow beyond a certain set of boundaries, you can just say, Okay, I'm just going to split it at this record boundary, it's not necessarily possible to do that with video or audio without compromising the value that you're getting out of it.
Thomas Henson
0:14:47
Yeah, I mean, so that's for sure. Like, you're looking at it from that perspective of, you know, being able to how you compress it, or, you know, break it in even know, we're talking about massive amounts of data, large, you know, data sets, being able to break those into chunks. But I mean, even thinking about it from a compute perspective, when we're just talking about RAM, right. Like, a lot of times, whenever we're talking about being able to run a job, or run some, maybe if it's a spark image, or a spark job, or some map, you know, traditional MapReduce job in your cluster, you might, you might have a ton of RAM, right. But, you know, think about, you know, I mean, we're talking at this scale, we're talking terabytes or petabytes, to, you know, I was just reading an article where they were talking about the predictions that we're at 33 zettabytes of data worldwide today, and by 2025, so less than six years away, we're going to be at 170, 175
0:15:34
zettabytes. And so like I mean, it's just, it's just massive, you know, it's just crazy, just to think about how big the data is, and how much data that we're creating from this. And I mean, it's also fun, because we're changing the way that we interact with society out there. And we can get into, you know, where we think AI is, and where we think the boundaries are, and how much how much of it is maybe hype or not. But I'll say that my favorite thing to kind of talk about whenever we're talking about just a as a concept is really, it's just an extension of automation at this point. But it's just automation that we could do, right.
Tobias Macey
0:16:05
And in terms of the actual responsibilities of the data engineer for the data as it's being delivered to these algorithms, particularly as it compares to machine learning, where you might need to do up front feature extraction and feature identification, to be able to get the most value out of the algorithm. My understanding is that with deep learning, you're more likely to just provide coarse grained labeling of the information, and then rely on the deep learning neural networks to extract the useful features. So I'm wondering if you can talk a bit about how the responsibilities of the data engineer shift was, you're going from machine learning into deep learning, particularly from the standpoint of feature extraction and labeling.
Thomas Henson
0:16:49
Yeah, so ETL is not going away, you know, there's still going to be ATL involved, and they're still going to be, you know, whether we call it data wrangling or data, data mining, right, we're still a lot of what I'm seeing a lot of what we're talking about, and, you know, I've talked to the Chief Data Science Officer, you know, at SAS, and, you know, one of the things that he was saying is, you know, we're still mostly doing supervised learning. So we're on the path of supervised learning, where we have to have these train labeled data sets, right. And so, you know, data is still King and labeled data is still King as well, just because of that fact, you know, you know, we do think, you know, in the next in the next five years or so, we might start seeing more advances from an unsupervised learning perspective, and just really seeing that, but there's still a lot of time. And I think, I think there was a stat that that was out there, and I wish I could credit who it was from, but I think 79% of a data scientist or data engineers job is still things outside of data engineering, and data science. And part of that goes back to, you know, there's, there's a big portion of that that's, you know, part of the data wrangling and part of the ETL that's involved. But then also, one of the things and, you know, this is, this is something as data engineers, and, you know, as we shift into, you know, creating a new role called the machine learning engineer, but it's a, you know, around the same around the same time type of concepts, one of the things that will never get out of, and probably the reason we like, being a data engineer versus a data scientist, is, there's still a lot of importing, making sure we have the right software packages, making sure that, you know, this version, you know, if we're using an Nvidia card that this version of Kudu and it's going to work with the version of TensorFlow that we're trying to line up. So there's still a lot of that to outside of just, you know, making sure that we have the right data and making sure that we have the right data sets, and hey, you know, we, you know, for using some kind of storage, like, you know, make sure we've allocated enough, right, like, if we're taking data that's off of a simulation, hey, do we do we have a big enough footprint to hold that hundred terabytes is going to be written that needs to be read as soon as it's written to. So there's, there's still a lot of fun, you know, things for us as data engineers to stay technical on that side. But there's new challenges with this, too.
Tobias Macey
0:18:53
And for anybody who's in the early stages of deep learning project, I'm curious what somebody of the edge cases or gotchas that they should be aware of are, and particularly ones that you've experienced yourself, as you're working on these types of projects.
Thomas Henson
0:19:08
Yeah, so, you know, some of the edge cases are, some of the things that start kind of looking at is, you know, it's, it's a little bit of a different approach, like I said, if you're coming from the Hadoop ecosystem, and looking at that, it's at this point, it's a little bit, it's a little bit more simple, right? Like, I was saying, like, you can, it's easy to get started, you know, from the perspective of, hey, you can set up an Fs mount, and just be able to, you know, point points your jobs at it, oh, and, you know, from a TensorFlow perspective, or pytorch, making sure, you know, GPUs are going to be the big piece, right, like, so, you know, it's recommended, you know, the install, you install it, and you use, you know, the different packages with different GPU cards that you have, can do it with CPU, like, if, you know, if you're just trying to do a PLC, you're still trying to do some testing to validate, Hey, I know, I know, the steps to kind of go through it, there are some of those different libraries that you can use, like I said, you know, GPU, for the, for the most perspective there. And then, you know, there's a lot of things calling and going back and forth. So making sure that you have chicken, you know, checking your card with the latest version of Kudu using TensorFlow or using pytorch. And so I would look to look to that. And another thing to do is, you know, start thinking about how this is going to grow. So just just like we kind of joked about, and then do because system is, you know, once people start understanding that you can do big data, the whole, you know, some of the reasons that these projects get funded, or because, you know, like, we were just talking about it, you know, the AI is a hot topic right now. And four or five years ago, Hadoop was the same thing so those projects get greenlit, right, you get funding to stand up those projects. But there's a lot of attention on you, too. So there's gonna be, there's gonna be a lot of ask, right, like, Oh, hey, you know, the analytics group down the down the, down the hall there, they're involved in a project, oh, wow, I'd love to get them to put some AI online. So, you know, you want to understand that and understand how that's going to grow. And kind of another thing, and you're seeing this goal, just from a data engineers perspective, too. But this is, you know, on the forefront just because of where we are, from a deep learning perspective, containerization is huge. So, you know, if you're, if you're a data engineer, you know, when we first started out in the new ecosystem, it was like, Hey, man, you know, has to be on, you know, has to be on bare metal we can virtualize and, you know, now we're going cloud, native cloud era releasing, you know, different versions. So, you know, from a deep learning perspective, you know, orchestration and management and just understanding containerization, like, if that's not something that you have, and that's something that I've been trying to catch up on, you know, in the last year. So I definitely make sure that, you know, you're up to speed on that, because that's going to, it's going to play an important part. And so, like I said, there's, you know, there's tools out there that will help you manage that orchestration layer. But on the back end, I mean, it's, it's, it's essentially containerizing right, to be able to get, do your scheduling and doing everything, kind of like what we've seen in the yarn, Hadoop ecosystem piece. And
Tobias Macey
0:21:45
in terms of the level of familiarity and understanding that's necessary for being able to build out the underlying infrastructure and work effectively with the data scientists on these deep learning projects. How much knowledge of deep learning and machine learning and some of the mathematics and fundamental principles behind it, should we as data engineers, be aware of, in order to be able to continue to progress in our careers and work effectively as these types of projects become more prevalent?
Thomas Henson
0:22:16
So that's a it's a super good question, I get that question a lot is like, hey, even just from the basics of, you know, should I understand the algorithm should know, the algorithms for talking about machine learning, we're talking about deep learning, like, how much should I be able to recommend and look at, you know, TensorFlow, and it's, it's such the software engineering answer, right, to say, it depends. But really, it does, it's going to depend, right, like, you know, if you're in a small organization, and, you know, you guys are just going down, the don't going down the path, you probably know, maybe, maybe you have a data scientist, maybe, you know, maybe it's more of a, you know, data analyst in your organization, then you're going to want to be able to handle and be able to, you know, carry some of that now, I'm not going to say that you're going to want to recommend, oh, you know, we should use, you know, CNN here, or for doing machine learning, like, Hey, we should use, you know, PCI, or, you know, decision trees are from that perspective, but you definitely want to have a little more understanding around it. So that, you know, when it comes to your part, and your, your role in the organization, you can understand some of the tweaking and some of their kind of thought process around it, and, you know, add, you know, add something to the table now, in a large organization, right, that's maybe more mature in their analytics journey, or the deep learning journey, then you're going to be, you're going to be able to focus, you know, not going to have to focus as much on understanding the underlying, hey, you know, how does that, how does this math, you know, work, and, you know, what's, you know, what are the weights and biases, and you know, why there's so many different layers there. So, you wouldn't have to focus as much in a larger organization. But I will tell you one thing that I've found, and I said, I came from the data engineering side, and, you know, understood a little bit about the algorithms, but didn't really didn't really focus on them. One of the things that I like more about deep learning, then machine learning is, it's, it's really a little bit different math, like, it's a little more basic to, I think, I think, I've heard people say that, you know, whenever you're talking about it, you know, you can get away with, like, you know, the first Sunday of calculus from a deep learning perspective, versus, with machine learning, it's a lot more complex, right? When the algorithms and kind of what you're doing, and a lot of it goes back to what we were talking about, how about how the data is broken up from a deep learning perspective, is, it's really just, you know, it's really just matrix math, right? It's, you know, it's matrix algebra to be able to, you know, stack all these ones and zeros, or, you know, for us an RGB, you know, ones and zeros and trees, and, you know, all these different pieces together. So, we're using a lot of easy basic math, it's just really big math. So that's a long answer to say that it depends. But it really is going to, it's going to depend on your organization, and kind of where your role is. So I would encourage you, from a career perspective, to be a little bit little bit familiar with it, just, you know, have have a natural curiosity to it, but don't go, I wouldn't say that you have to go deep, right, you're not,
0:24:50
you're not going to have to go back and get a degree or, you know, you're not going to have to know the intricacies of you know, everything about it. But especially with the algorithms that you're using, or the different, you know, neural networks that you're that you're implementing in your organization, I'd be pretty much I'd be pretty familiar with there. But I wouldn't, I wouldn't stand up and put myself as I'm the one that's going to recommend which, you know, which approach we take, and you mentioned earlier to about the possibilities of leveraging some of these deep learning capabilities in the data preparation and ATL processes. So can you talk a bit more about the different ways that we can leverage the capabilities that are promised by deep learning as part of our own work in the data management process? Yeah, that's a good point. You know, when, whenever I kind of keyed on the point that, you know, we use supervised learning a lot. And just kind of recap, you know, if you think about supervised learning, that's where we have these, you know, going back to our cat photos, right, we have a lot of images, a, this is an image that contains a cat, this one doesn't contain a cat, right? And so we know the end outcome we're looking for when we're doing that. And then I talked about, you know, how unsupervised learning is kind of, you know, on the forefront and, you know, something that we're seeing, but you can use unsupervised learning to help out with some of your eta and some of your data wrangling. Right? So, unsupervised learning is where we have a, hey, have a million images and I just want you, you know, wants you to be able to classify them, right? Unless you go feed them in so this can help you to group so I talked about, you know, I don't think we're going to get out of BTL and I don't think we're going to get out of data wrangling for a while, but you can use, you know, like unsupervised learning to be able to pull and generate and put, you know, put some kind of order to all this structured, unstructured data, we have to. So, think about it, you know, kind of, you know, one of the famous examples, you know, that we've, we've done before is like, you know, sentiment analysis, right. So, think about when you're doing sentiment analysis, if you've ever walked through a tutorial on Twitter, but now, you can, you know, think about that, from the same perspective of, hey, do you know, there's, you know, we can, we can train our neural network to kind of just look at a whole bunch of images and kind of put all those to some kind of structure. And so, if you think about it, if, you know, if your job was to find these train data sets, right, like, you could, you can use an unsupervised learning to be able to, you know, categorize those and put them in clusters. So that, hey, you know, instead of looking at a million pictures, maybe I'm only looking at 100,000, right.
Tobias Macey
0:26:58
And as far as how that plays into the infrastructure requirements, and the processing requirements are actually being able to execute these ATL jobs as we're incorporating deep learning, what kind of impact does that have?
Thomas Henson
0:27:12
Yeah, so from a storage perspective, there's a big, big footprint, especially when we're talking about, you know, talk a little bit about the different environments. And so just, you think of your training environment, this is where we're building out those algorithms, we're training and we're hoping that, hey, you know, if we're able to what we're training our neural network to be able to do the outcome. So back to our cat identifier, you know, we're sending, you know, millions of images through to be able to train those models. And so, you know, to be able to do that, right, you have to have the storage or the output to be able to do that. And then you also have the app data throughput, right? We're talking about, you know, from a perspective of the, you know, these are some of the most powerful chips on the planet, right, you know, thinking about your GPUs. And so there's some, there's some specifically, you know, specifically for doing things on prem, there's some requirements there from the just how do you how do you get enough power on a 40 mile to GPU these GPUs, right? Like, you know, you're limited to how much power they're going to pull, right. And then there's heating and some some of the other requirements. So there's a lot of processing that goes on in that part of the workload. And then when we flip to, okay, we've got our, we've got our image, now, let's train it out, you know, in the world world, and see if it's working, that's more than the inference, right. So that's where we talk about an inference environment. So the best, the easiest way I like to think about it is becoming from a, you know, application development, you know, background is you look at, you look at your training environment, think of that as kind of test Dev and think of your influence, you know, it's kind of more your production environment. And that's where you really, really train, hey, we're going to see a whole bunch of images in there, and we're not doing any more training. And we're saying, Hey, can you know, Kennedy, identify a cat or not a cat? Or, you know, can it can it drop down this practice road? Right? You know, or do we need to go back and train it some more. And so, but with that being said, you know, an application development environment, new new, think about it, like, Oh, your test dev normally, that's not your biggest footprint, right? But in deep learning, this is, this is where the majority of your data lives, right? Because like I said, you're you're trying, you're trying to get the best amount of data and the most data into into training these algorithms, so that whenever you go into production, or you want to test it, and inference, if you've made the best decisions, and particularly in terms of the infrastructure layer, there have been a lot of new offerings coming out from the various cloud providers that are aiming to provide access to pre trained neural networks, or managed services around being able to execute these different deep learning algorithms and be able to pipeline data into them and out of them.
Tobias Macey
0:29:34
So I'm wondering what your thoughts are in terms of the build versus by decision around deep learning with the availability of these managed services. And then also, particularly at the layer ETL of does it make sense to build out your own additional capacity for being able to run these algorithms or just start consuming these managed services, at least as an initial step of simplifying and enhancing your capability to provide meaningful data processing and ATL for feeding into the end product that you're actually aiming for?
Thomas Henson
0:30:10
Yeah, no, I mean, it's a, it's a good way to kind of look at it is like what you have, whenever we're talking about, you know, with the cloud providers, you know, you have this, let me think of it as, like a catalog sometimes of, Hey, you know, there's these different there's different approaches and different algorithms that can use and turn it towards my data, right. And it really, you know, think of it as a service, right? Like, it's going to almost be your data scientist, you know, in the cloud, or data scientists behind your browser to say, Hey, you know, here's some data, why don't you, you know, test this out and see, if, you know, you can use this for your average with that, you know, if, you know, it's a good way to start, you know, looking at things and seeing Hayes, you know, are things viable for what we want to do. But at the same time, we talked about the data gravity, like, whereas the majority of your data live, right, like, if the majority of your data exists in the cloud, you know, and you're wanting to take part in these managed services just have to understand, you know, from a business perspective, like, Hey, you know, you're, you know, you're offloading some of your data scientists are your data analysts, right? Like, you're not having to make this make as much of an approach and a research perspective of, Hey, you know, we're trying to build out and trying to see, you know, which algorithm is going to be best for our data sets. So, it does give you kind of a guideline to be able to test out and be able to look at that. But then at the same time, you know, if you had, you know, if you have, if you're have a lot of your data, like on prem, and you built your own systems there, and you have your own research and your own talent in house, and that's probably going to be the best approach for you, right? Like, you're, you're going to want to build your own systems, and you're going to want to build your own algorithms, because, you know, your data is unique. And so it's not an approach that you want to take to be a, hey, you know, we're gonna, we're gonna transfer and move our data up, you know, or, or, or send it up into some of these managed services. So, you know, it kind of goes back to, you know, through the years, we go through different debates in different areas is like, Hey, you know, is are we going to offload? Are we going to outsource to consultants, or anything like that. So it's just really about how that you want to approach from that perspective. So for small teams that maybe don't have a data scientist, you know, there are tools that are both on premise off premise, the same kind of decision, right? Like, how do we how are we going to approach how we're going to build out our models? And do we want to, you know, especially if you're just starting out, there's, you know, products like data robot, and other pieces that give you give you the ability to look and see what you're doing, and give your data test and the sample to say, Hey, you know, these items might work for you, right, like this might give you the answer that you're looking for. But from a deep learning approach right now, I mean, I think, you know, we're, we're starting to see those products, and those tools come out as well, we've had them like I said, for machine learning perspective for for some time, but you're starting to see those being integrated in a lot of tools. So I think we'll continue to see this debate continue, and will continue to see, you know, product offerings in other maybe not even analytic tools that are starting to take advantage of deep learning.
Tobias Macey
0:32:47
What is your personal litmus test for determining when it is useful and practical to use deep learning as opposed to traditional machine learning algorithm, or even just a basic decision tree for providing a given prediction or decision on whatever the input data might happen to be?
Thomas Henson
0:33:09
Yeah, I mean, so when never we look at that, traditionally, what I've seen right now, like, kind of going back to what we're talking about with, you know, it's really good when we're from a deep learning perspective, and we're talking about image data, and we're talking about, you know, video data, or audio files, or, you know, those kind of rich media types, I still see, you know, for the most part, whenever we're talking about, you know, like, if we're looking for, you know, classic example, would like, Hey, can you predict housing, you know, housing rates, and, you know, mortgage rates, you know, or can you predict housing prices, and a certain, you know, in a certain geo, a lot of those approaches, since you already kind of have the feature sets, they're really good to work in machine learning, it's not that you can't use them, there's plenty examples out there to use deep learning for it. But traditionally, you know, I still see that as the use case. All that being said, though, you know, like I said, I still come from the data engineering side. So, you know, if you're listening to the podcast, and your data engineering, there's a data scientist in your organization to, you know, maybe you should, maybe you should rely on them, you know, but be curious about it to, you know, maybe ask them, Hey, you know, why, you know, if it's, if it's different than, than what you think, you know, kind of take that approach, but that's kind of what I've seen, it's that, that's kind of my rule of thumb. But I'm not gonna argue with a data scientist, if they want to, they want to kind of test some of that out. And like I said, there's plenty of examples out there. But I still see, you know, for we're talking about, like, some of the traditional, you know, traditional, you know, semi structured, or unstructured data, or data where we just, we have all the, all the points for us, it's not, you know, video files, or audio files or videos, then, you know, we're, we can still stick with the machine learning approach.
Tobias Macey
0:34:38
And with deep learning algorithms, they're often a black box in terms of identifying what features contributed to the given decision that it outputs and with regulations, such as GDPR in particular, but others that are either active in different locations, or in process of being formulated, they are introducing requirements to be able to identify what were the different factors that played into whatever the decision might be, especially when it impacts a an individual. So how does that factor into your determination of what approach to take for a given project as far as whether you use deep learning or machine learning or just a standard, you know, Boolean logic based approach?
Thomas Henson
0:35:24
Yeah, I mean, that's, that's definitely an interesting question. You know, when we think about it from any kind of regulation, we're talking about GDPR and this specific example. But, you know, if you think about what's going on from the, you know, autonomous driving cars, right? Like, hey, you know, at some point, you've got to figure out, like, where do we make a decision for the car to drive into a brick wall or to, you know, hit somebody on a bicycle, right? How can we go back and kind of prove that, and then it's really the regulation, maybe not even the technology that's going to, you know, because that's going to keep me driving. So, same thing, same thing here, we're talking about, can you go back and prove, you know, which, which weight which bias that we had, from a deep learning perspective, you know, there are ways there are ways whenever you're looking at the neural networks to be able to do, you know, back propagation, and kind of look back and see, okay, you know, where did we wait, one feature, you know, how does all that kind of work? I don't know, from a legal perspective, how that would how to kind of play out with GDPR, right? Like, what's the level of proof that you would need, but yeah, those are, those are different, definitely, you know, different challenges, you know, that we're looking at, from, not even from a data engineers perspective, or from a data scientist, but like, we're all like, these are our projects, right? Like, we are all involved in this conversation as well. So those are all, you know, like, interesting points, but I don't think it's something that, you know, that we're going to be able to solve. But, you know, from the GDPR, there's so many different layers to that we think that, you know, from a regulation perspective, being able to take and being able to, you know, go back and say, Hey, okay, we have different data elements that aren't going to leave, you know, this, this specific border, the specific, you know, GPS coordinates, right, like, you know, data, is data going to stay in the country in which it originated, well, how do you take data that's been originated in a country, but you've trained models on it, and you've deployed the model other way, right? Like, you're not moving the data, but the, but how, you know, how does all that kind of tie in. So those are, those are huge points, you know, huge, huge things that we're will see play out for years and years,
Tobias Macey
0:37:20
it's a whole big ball of yarn to try and untangle. And then there are the aspects of bias in terms of how the different features are weighted, and what the training data has, in terms of inherent bias because of how it's collected, or how its represented. And that's something that is an ongoing conversation, and one that I don't know that we'll ever find a complete solution to, but something that is definitely useful to keep in mind as we build these different products. Because it's important to be thinking about it, even if we don't have a perfect answer. Because Don't let the perfect be the enemy of the good, especially as it pertains to people's privacy, or rights or inherent biases, and how they're represented in the software and projects that we build.
Thomas Henson
0:38:01
You know, I think that could be like a multi part, maybe even ongoing podcast for for you episode to have, right, just just peeling back the layers of GDPR, and what's our, you know, what's our responsibility? And, you know, just just things to think about, right.
Tobias Macey
0:38:16
And for anybody who wants to learn more about deep learning from the perspective of a data engineer, or who might be interested in deploying it for their own projects, or building projects based on it, what are some of the resources that you have found useful, and that you recommend other people take a look at?
0:38:34
Yeah, so like I said, started out with the machine learning course, from Coursera, I'm actually going through right now, the deep learning boot camp, I think it's like the deep learning AI. So it's engineering course, around it. Python development, really cool hands on with TensorFlow. So, you know, that's, that's been very interesting. And then from, from my own perspective, so like I said, I'm a poor site author. And, you know, one of the things that that I, that I went through and did last year is, you know, release it this year was did a data engineers course, on, you know, kind of TensorFlow and use something called TF learn, which is an abstraction layer for TensorFlow. And so it just gives you the ability to just think of it is from a data engineer perspective, like, think of how we went with Pig Latin, right? Like Pig Latin, could take you from 140 lines of Java down to, you know, eight or 1010 lines of code. Same thing from A to learn perspective. So I went through, you know, I've created a course specific to data engineers, and how to kind of get started with, hey, you know, build your first, you know, build your first neural network. So, a lot of resources around that, like, say, there's a lot out there on Coursera, there's some free courses as well, on a Google Google has a machine learning boot camp, and it kind of just, it kind of goes through, I think they say it's like, 30 days or something like that. But I think it's something I was able to knock out, like, two weeks, you know, offers labs and everything. And usually the explanations and, you know, little quizzes in it, too. So there's a lot of resources out there, it's very popular, huge documentation out there for TensorFlow. So just get out there and start looking at it. And, you know, just if you don't understand it, it's okay. Right. Like, that's, that's the thing, just get in, start learning it after, you know, after you can keep going through repetition, you'll start to understand,
0:40:06
and are there any other aspects of deep learning from the perspective of a data engineer that we didn't discuss yet that you think we should cover before we close out the show
Thomas Henson
0:40:15
now, like I said, I think the two you know, the three biggest things or biggest things I would look look to from a data engineers effective is just kind of watching what's going on projects like submarine sparks project hydrogen, and then looking into what in videos, get some documentation and some some blog posts out there on what they're doing. And then video rapids, I think that's going to have a huge impact on our day to day jobs as data engineers, just from the aspect of a speeding up some of the CTO pipelines, and then also being able to access what's going on with every TensorFlow, or pie torture cafe,
Tobias Macey
0:40:48
right. And for anybody who wants to follow along with you, or get in touch or see the work that you've been doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
0:41:05
Yeah, so the biggest tooling, and the biggest gap, probably for data management, it's, it's going to be in the ATL arena. I mean, we mean, it's, it's something it's how I started out, right. Like I said, I volunteered for a job and volunteered for the job. I didn't have experience in data engineering and didn't have it in the Hadoop ecosystem. So my first job was doing ATL right. And we keep going through the year saying, Hey, this is something that we're going to fix, right? Like, hey, we have this tool or that tool, but and I'm not saying it's not getting better. But I mean, I think it's one of those things, you know, until we can train the machines to do it for us. I think it's always going to be something we do.
0:41:43
All right. Well, I appreciate you taking the time to join me and share your experiences working with deep learning and how it plays into your work. As a data engineer. It's definitely useful and interesting to get that background and keep an eye on different areas of concern for people working in the industry. So I appreciate that and I hope you enjoy the rest of your day. You too. Thanks.