Companies

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Looker is and the problem that it is aiming to solve?
    • How do you define business intelligence?
  • How is Looker unique from other approaches to business intelligence in the enterprise?
    • How does it compare to open source platforms for BI?
  • Can you describe the technical infrastructure that supports Looker?
  • Given that you are connecting to the customer’s data store, how do you ensure sufficient security?
  • For someone who is using Looker, what does their workflow look like?
    • How does that change for different user roles (e.g. data engineer vs sales management)
  • What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?
  • What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
    • What are the portions of the Looker architecture that you would do differently if you were to start over today?
  • What are some of the most interesting or unusual uses of Looker that you have seen?
  • What is in store for the future of Looker?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
    • Where are you using notebooks and where are you not?
  • What is the technical infrastructure that you have built to suppport that design choice?
  • Which team was driving the effort?
    • Was it difficult to get buy in across teams?
  • How much shared code have you been able to consolidate or reuse across teams/roles?
  • Have you investigated the use of any of the other notebook platforms for similar workflows?
  • What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
  • What are some of the limitations of the notebook environment for the work that you are doing?
  • What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
  • What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Summary

One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what MemSQL is and how the product and business first got started?
  • What are the typical use cases for customers running MemSQL?
  • What are the benefits of integrating the ingestion pipeline with the database engine?
    • What are some typical ways that the ingest capability is leveraged by customers?
  • How is MemSQL architected and how has the internal design evolved from when you first started working on it?
    • Where does it fall on the axes of the CAP theorem?

    • How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?

    • Can you describe the lifecycle of a write transaction?
  • Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?

    • How do you mitigate the impact of network latency throughout the cluster during query planning and execution?
  • How much of the implementation of MemSQL is using custom built code vs. open source projects?

  • What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?
  • What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?
  • When is MemSQL the wrong choice for a data platform?
  • What do you have planned for the future of MemSQL?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
    • How do you define the concept of a knowledge graph?
  • What are the processes involved in constructing a knowledge graph?
  • Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
  • What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
    • How do you manage the software lifecycle for your ETL code?
    • What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
  • What are the current challenges that you are facing in building and scaling your data infrastructure?
    • How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
    • What techniques are you using to manage accuracy and consistency in the data that you ingest?
  • Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
  • What are the weak spots in your platform that you are planning to address in upcoming projects?
    • If you were to start from scratch today, what would you have done differently?
  • What are some of the most interesting or unexpected uses of your product that you have seen?
  • What is in store for the future of Enigma?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

  • Introductions
  • How did you get involved in the area of data engineering and data management?
  • What is Snowplow Analytics and what problem were you trying to solve when you started the company?
  • What is unique about customer event data from an ingestion and processing perspective?
  • Challenges with properly matching up data between sources
  • Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
    • Cleanliness/accuracy
  • What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
  • Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
    • How has that architecture evolved from when you first started?
    • What would you do differently if you were to start over today?
  • Ensuring appropriate use of enrichment sources
  • What have been some of the biggest challenges encountered while building and evolving Snowplow?
  • What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

Summary

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?
    • What types of data are you focused on supporting?
    • What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?
  • Is there any need for an Elasticsearch cluster in addition to Chaos Search?
  • For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3?
  • What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL?
  • Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS?
  • What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster?
  • What is the system architecture that you have built to allow for querying terabytes of data in S3?
    • What are the biggest contributors to query latency and what have you done to mitigate them?
  • What are the options for access control when running queries against the data stored in S3?
  • What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen?
  • What are your plans for the future of Chaos Search?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Summary

There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use

Interview

  • Introduction
  • How did you get involved in the area of data security?
  • Can you start by explaining what your mission is with Enveil and how the company got started?
  • One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?
    • What are some of the challenges associated with scaling homomorphic encryption?
    • What are some difficulties associated with working on encrypted data sets?
  • Can you describe the underlying architecture for your data platform?
    • How has that architecture evolved from when you first began building it?
  • What are some use cases that are unlocked by having a fully encrypted data platform?
  • For someone using the Enveil platform, what does their workflow look like?
  • A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?
  • What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)
  • What do you have planned for the future of Enveil?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data security today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Ona and how did the company get started?
    • What are some examples of the types of customers that you work with?
  • What types of data do you support in your collection platform?
  • What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
  • Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
  • What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
  • Can you describe the flow of the data from collection through to analysis?
  • To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
    • What are the architectural considerations that you factored in when designing it?
    • What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
  • What are your plans for the future of Ona and Canopy?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary

Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • To start, can you explain the problem space that Alegion is targeting and how you operate?
  • When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?
  • What are some of the biggest challenges associated with managing human input to data sets intended for machine usage?
  • For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?
    • What tools and processes do you have in place to ensure the accuracy of their inputs?
    • How do you prevent bad actors from contributing data that would compromise the trained model?
  • What are the limitations of crowd-sourced data labels?
    • When is it beneficial to incorporate domain experts in the process?
  • When doing data collection from various sources, how do you ensure that intellectual property rights are respected?
  • How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?
    • What kinds of metadata do you track and how is that recorded/transmitted?
  • Do you think that human intelligence will be a necessary piece of ML/AI forever?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is the intended use case for Quilt and how did the project get started?
  • Can you step through a typical workflow of someone using Quilt?
    • How does that change as you go from a single user to a team of data engineers and data scientists?
  • Can you describe the elements of what a data package consists of?
    • What was your criteria for the file formats that you chose?
  • How is Quilt architected and what have been the most significant changes or evolutions since you first started?
  • How is the data registry implemented?
    • What are the limitations or edge cases that you have run into?
    • What optimizations have you made to accelerate synchronization of the data to and from the repository?
  • What are the limitations in terms of data volume, format, or usage?
  • What is your goal with the business that you have built around the project?
  • What are your plans for the future of Quilt?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA