Ship Faster With An Opinionated Data Pipeline Framework - Episode 100

Summary

Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Kedro is and its origin story?
  • Who are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists?
    • Can you talk through a typical lifecycle for a project that is built using Kedro?
  • What are the overall features of Kedro and how do they compound to encourage best practices for data projects?
  • How does the culture and background of QuantumBlack influence the design and capabilities of Kedro?
    • What was the motivation for releasing it publicly as an open source framework?
  • What are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project?
  • Can you describe how Kedro itself is implemented and how it has evolved since you first started working on it?
  • There has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term?
  • How do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster?
  • It has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated?
    • What is still missing for a stable 1.x release?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Kedro used?
  • When is Kedro the wrong choice?
  • What do you have in store for the future of Kedro?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Ship Faster With An Opinionated Data Pipeline Framework 1