Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.
Datacoral is this week’s Data Engineering Podcast sponsor. Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.
Do you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications
- How did you get involved in the area of data management?
- Can you start by explaining what Dagster is and the origin story for the project?
- In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
- Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
- What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
- What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
- For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
- For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
- Once they have something working, what is involved in deploying it?
- One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
- Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
- It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
- Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
- What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
- Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
- You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
- What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
- What is on your roadmap that you consider necessary before creating a 1.0 release?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Matei Zaharia
- DataOps Episode
- Supervised Learning
- Maxime Beauchemin
- Dagster Testing Guide
- Great Expectations