Linode

Data Labeling That You Can Feel Good About - Episode 89

Summary

Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what CloudFactory is and the story behind it?
  • What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
  • What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
  • Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
    • What protocols do you have in place to ensure data quality and identify potential sources of bias?
  • What role do humans play in the lifecycle for AI and ML projects?
  • I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
    • How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
  • Can you share some stories of cloud workers who have benefited from their experience working with your company?
  • What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
  • What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
  • What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
  • What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
    • How does that tie into your plans for CloudFactory in the medium to long term?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Scale Your Analytics On The Clickhouse Data Warehouse - Episode 88

Summary

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Clickhouse is and how you each got involved with it?
    • What are the primary use cases that Clickhouse is targeting?
    • Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
  • Can you describe how Clickhouse is architected?
  • Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
    • I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
  • Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
    • For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
  • What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
  • For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
  • Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
  • How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
    • How is data replication and consistency managed?
  • What is involved in deploying and maintaining an installation of Clickhouse?
    • I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
  • What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
  • What are the shortcomings of Clickhouse and how do you address them at Altinity?
  • When is Clickhouse the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - Episode 87

Summary

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
    • What are some example cases where anomaly detection is useful or necessary?
  • Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
  • What was your selection criteria for the various components of your system design?
    • What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
      • If you were to start over today would you do any of it differently?
  • Can you talk through the algorithm that you used for detecting anomalous activity?
    • What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
  • What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
  • What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
    • How did those bottlenecks differ as you moved through different levels of scale?
  • What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
  • What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
  • How have those lessons fed back to your work at Instaclustr?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Workflow Engine For Data Engineers And Data Scientists - Episode 86

Summary

Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Prefect is and your motivation for creating it?
  • What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
  • In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
  • How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
  • How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
  • What was your decision making process when deciding to use Dask as your supported execution engine?
    • For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
  • Does Prefect support managing tasks that bridge network boundaries?
  • What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
  • What are the limitations of the open source core as compared to the cloud offering that you are building?
  • What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
  • When is Prefect the wrong choice?
  • In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
    • What are some best practices and industry trends that you are most excited by?
  • What do you have planned for the future of the Prefect project and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Maintaining Your Data Lake At Scale With Spark - Episode 85

Summary

Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Delta Lake is and the motivation for creating it?
  • What are some of the common antipatterns in data lake implementations and how does Delta Lake address them?
    • What are the benefits of a data lake over a data warehouse?
      • How has that equation changed in recent years with the availability of modern cloud data warehouses?
  • How is Delta lake implemented and how has the design evolved since you first began working on it?
    • What assumptions did you have going into the project and how have they been challenged as it has gained users?
  • One of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested?
    • In your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.)
  • Can you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store)
    • Are there limits in terms of the volume of data that can be managed within a single transaction?
  • How does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user?
    • The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach?
  • What have been the most difficult/complex/challenging aspects of building Delta Lake?
  • How is the data versioning in Delta Lake implemented?
    • By keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process?
  • What are the reasons for standardizing on Parquet as the storage format?
    • What are some of the cases where that has led to greater complications?
  • In addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date?
  • When is Delta Lake the wrong choice?
    • What problems did you consciously decide not to address?
  • What is in store for the future of Delta Lake?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Managing The Machine Learning Lifecycle - Episode 84

Summary

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Hydrosphere is and share its origin story?
  • In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
    • How does it differ from deployment and maintenance of a regular software application?
  • Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
  • For someone who is using Hydrosphere in their production workflow, what would that look like?
    • What is the difference in interaction with Hydrosphere for different roles within a data team?
  • What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
    • Which metrics do you track for testing and verifying the health of the data?
  • What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
  • How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
    • How has that influenced the design and direction of Hydrosphere, both as a project and a business?
    • How has the design of Hydrosphere evolved since you first began working on it?
  • What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
  • What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
  • What do you have in store for the future of Hydrosphere?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Evolving An ETL Pipeline For Better Productivity - Episode 83

Summary

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
  • Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
    • What are your primary sources of data and what are the targets that you are loading them into?
  • What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
    • What were your criteria for your replacement technology and how did you gather and evaluate your options?
  • Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
    • What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
    • What were the big wins?
  • What was your evaluation framework for determining whether your re-engineering was successful?
  • Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
    • If you have freed up time for your engineers, how are you allocating that spare capacity?
  • What do you hope to see from DataCoral in the future?
  • What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Lineage For Your Pipelines - Episode 82

Summary

Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pachyderm is and how it got started?
    • What is new in the last two years since I talked to Dan Whitenack in episode 1?
    • How have the changes and additional features in Kubernetes impacted your work on Pachyderm?
  • A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?
  • Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?
    • How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?
  • There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?
  • Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?
    • What is the interface for exposing and exploring that provenance data?
  • What are some of the advanced capabilities of Pachyderm that you would like to call out?
  • With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?
  • What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?
  • What do you have planned for the future of Pachyderm?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Build Your Data Analytics Like An Engineer - Episode 81

Summary

In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DBT is and your motivation for creating it?
  • Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline?
  • Can you talk through the workflow for someone using DBT?
  • One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented?
  • The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?
    • Are these packages driven by Fishtown Analytics or the dbt community?
  • What are the limitations of modeling everything as a SELECT statement?
  • Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?
    • What are your thoughts on higher level approaches to SQL that compile down to the specific statements?
  • Can you explain how DBT is implemented and how the design has evolved since you first began working on it?
  • What are some of the features of DBT that are often overlooked which you find particularly useful?
  • What are some of the most interesting/unexpected/innovative ways that you have seen DBT used?
  • What are the additional features that the commercial version of DBT provides?
  • What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT?
  • When is it the wrong choice?
  • What do you have planned for the future of DBT?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80

Summary

The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you explain what FoundationDB is and how you got involved with the project?
  • What are some of the unique use cases that FoundationDB enables?
  • Can you describe how FoundationDB is architected?
    • How is the ACID compliance implemented at the cluster level?
  • What are some of the mechanisms built into FoundationDB that contribute to its fault tolerance?
    • How are conflicts managed?
  • FoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available?
    • Is it possible to apply different layers, such as relational and document, to the same underlying objects in storage?
  • One of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput?
  • For someone who wants to run FoundationDB can you describe a typical deployment topology?
    • What are the scaling factors for the underlying storage and for the Layers that are operating on the cluster?
  • Once you have a cluster deployed, what are some of the edge cases that users should watch out for?
    • How are version upgrades managed in a cluster?
  • What are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB?
  • What are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used?
  • When is FoundationDB the wrong choice?
  • What is in store for the future of FoundationDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA