Machine Learning

Deep Learning For Data Engineers - Episode 71

Summary

Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off
  • Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it?
  • What has been your personal experience with deep learning and what set you down that path?
  • What is involved in building a data pipeline and production infrastructure for a deep learning product?
    • How does that differ from other types of analytics projects such as data warehousing or traditional ML?
  • For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of?
  • What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate?
  • What are some ways that we can use deep learning as part of the data management process?
    • How does that shift the infrastructure requirements for our platforms?
  • Cloud providers have been releasing numerous products to provide deep learning and/or GPUs as a managed platform. What are your thoughts on that layer of the build vs buy decision?
  • What is your litmus test for whether to use deep learning vs explicit ML algorithms or a basic decision tree?
    • Deep learning algorithms are often a black box in terms of how decisions are made, however regulations such as GDPR are introducing requirements to explain how a given decision gets made. How does that factor into determining what approach to take for a given project?
  • For anyone who wants to learn more about deep learning, what are some resources that you recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Machine Learning Projects In The Enterprise - Episode 69

Summary

Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Kevin Dewalt about his experiences at Prolego, building machine learning projects for Fortune 500 companies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • For the benefit of software engineers and team leaders who are new to machine learning, can you briefly describe what machine learning is and why is it relevant to them?
  • What is your primary mission at Prolego and how did you identify, execute on, and establish a presence in your particular market?
    • How much of your sales process is spent on educating your clients about what AI or ML are and the benefits that these technologies can provide?
  • What have you found to be the technical skills and capacity necessary for being successful in building and deploying a machine learning project?
    • When engaging with a client, what have you found to be the most common areas of technical capacity or knowledge that are needed?
  • Everyone talks about a talent shortage in machine learning. Can you suggest a recruiting or skills development process for companies which need to build out their data engineering practice?
  • What challenges will teams typically encounter when creating an efficient working relationship between data scientists and data engineers?
  • Can you briefly describe a successful project of developing a first ML model and putting it into production?
    • What is the breakdown of how much time was spent on different activities such as data wrangling, model development, and data engineering pipeline development?
    • When releasing to production, can you share the types of metrics that you track to ensure the health and proper functioning of the models?
    • What does a deployable artifact for a machine learning/deep learning application look like?
  • What basic technology stack is necessary for putting the first ML models into production?
    • How does the build vs. buy debate break down in this space and what products do you typically recommend to your clients?
  • What are the major risks associated with deploying ML models and how can a team mitigate them?
  • Suppose a software engineer wants to break into ML. What data engineering skills would you suggest they learn? How should they position themselves for the right opportunity?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building Enterprise Big Data Systems At LEGO - Episode 66

Summary

Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper S√łgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?
    • What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data?
    • What was the transition process like, migrating data silos into a uniformly managed platform?
  • What are the biggest data challenges that you face at LEGO?
  • What are some of the most critical sources and types of data that you are managing?
  • What are the main components of the data infrastructure that you have built to support the organizations analytical needs?
    • What are some of the technologies that you have found to be most useful?
    • Which have been the most problematic?
  • What does the team structure look like for the data services at LEGO?
    • Does that reflect in the types/numbers of systems that you support?
  • What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support?
  • What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO?
  • How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed?
  • How does the global nature of the LEGO business influence the design strategies and technology choices for your platform?
  • What are you most excited for in the coming year?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Spark is?
    • What are some of the main use cases for Spark?
    • What are some of the problems that Spark is uniquely suited to address?
    • Who uses Spark?
  • What are the tools offered to Spark users?
  • How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
  • For someone building on top of Spark what are the main software design paradigms?
    • How does the design of an application change as you go from a local development environment to a production cluster?
  • Once your application is written, what is involved in deploying it to a production environment?
  • What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
  • What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
  • What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
  • What are the limitations of the Spark programming model?
    • What are the cases where Spark is the wrong choice?
  • What was your motivation for writing a book about Spark?
    • Who is the target audience?
  • What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
  • What advice do you have for anyone who is considering or currently using Spark?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

  • Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Summary

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by establishing a definition of data mastering that we can work from?
    • How does the master data set get used within the overall analytical and processing systems of an organization?
  • What is the traditional workflow for creating a master data set?
    • What has changed in the current landscape of businesses and technology platforms that makes that approach impractical?
    • What are the steps that an organization can take to evolve toward an agile approach to data mastering?
  • At what scale of company or project does it makes sense to start building a master data set?
  • What are the limitations of using ML/AI to merge data sets?
  • What are the limitations of a golden master data set in practice?
    • Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?
    • Are there specific problem domains that are more likely to benefit from a master data set?
  • Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data)
  • What storage mechanisms are typically used for managing a master data set?
    • Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?
    • How do you manage latency issues when trying to reference the same entities from multiple disparate systems?
  • What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?
    • What suggestions do you have to help prevent such a project from being derailed?
  • What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary

Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • To start, can you explain the problem space that Alegion is targeting and how you operate?
  • When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?
  • What are some of the biggest challenges associated with managing human input to data sets intended for machine usage?
  • For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?
    • What tools and processes do you have in place to ensure the accuracy of their inputs?
    • How do you prevent bad actors from contributing data that would compromise the trained model?
  • What are the limitations of crowd-sourced data labels?
    • When is it beneficial to incorporate domain experts in the process?
  • When doing data collection from various sources, how do you ensure that intellectual property rights are respected?
  • How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?
    • What kinds of metadata do you track and how is that recorded/transmitted?
  • Do you think that human intelligence will be a necessary piece of ML/AI forever?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Interview

Alan Anders from Applecart

  • What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?
  • What are the biggest technical hurdles at Applecart?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Stepan Pushkarev from Hydrosphere.io

  • What is Hydropshere.io?
  • What metrics do you track to determine when a machine learning model is not producing an appropriate output?
  • How do you determine which data points to sample for retraining the model?
  • How does the role of a machine learning engineer differ from data engineers and data scientists?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Summary

The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is OctopAI and what was your motivation for founding it?
  • What are some of the types of information that you classify and collect as metadata?
  • Can you talk through the architecture of your platform?
  • What are some of the challenges that are typically faced by metadata management systems?
  • What is involved in deploying your metadata collection agents?
  • Once the metadata has been collected what are some of the ways in which it can be used?
  • What mechanisms do you use to ensure that customer data is segregated?
    • How do you identify and handle sensitive information during the collection step?
  • What are some of the most challenging aspects of your technical and business platforms that you have faced?
  • What are some of the plans that you have for OctopAI going forward?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Future Data Economy with Roger Chen - Episode 21

Summary

Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?
  • What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?
  • Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?
  • Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?
  • What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?
  • What do you view as being the privacy implications from creating and sharing these larger pools of data inventory?
  • What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?
  • With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA