Build Maintainable And Testable Data Applications With Dagster

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that coverage too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC.

Data Coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure.

Data Coral's customers report that their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance.

Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data.

He started data coral with the goal to make SQL the universal data programming language.

Visit dataengineeringpodcast.com/datacoral

today to find out more.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as Dataversity,

Corinium Global Intelligence, Eluxio, and Data Council.

Upcoming events include the combined events of the data architecture summit in Graphorum,

the data orchestration summit, and the data council in New York City.

Go to data engineering podcast.com/conferences

today to learn more about these and other events and take advantage of our partner discounts to save money when you register.

Your host is Tobias Macy. And today, I'm interviewing Nick Schrock about Dagster, Str, an open source system for building modern data applications. So, Nick, can you start by introducing yourself?

Yeah. Thanks for having me, Tobias.

My name is Nick Schrock. I'm the founder of a company

called Elementl

and our current project

is as you mentioned this open source framework

for building data applications, which is kind of the the word we use for describing systems like ETL pipelines, ML pipelines, and I'm sure we're gonna get into that.

Before

elemental and Daxter, the bulk of my career was spent at Facebook,

where I worked from 2009 to 2017,

and I worked on this team through most of my career

that I formed called product infrastructure,

whose job it was to produce technology

to empower our product developers and the users that they serve.

And

that, you know,

that team ended up producing some open source artifacts of Node, namely React, which I had nothing to do with, but I worked next to those folks for years, and then GraphQL, which I,

was 1 of the cocreators of.

And so from that, can you explain how you first got involved in the area of data management?

Yeah. Absolutely.

So I left

Facebook

in February of 2017,

which is actually a little over 2 years ago. You know, I took some time off, but I was thinking about what I was gonna do next. And

I actually,

you know, started talking to people across various industries because I was

actually looking for kind of almost like a non tech industry to work in that needed tech help.

And then as I was talking me, like, a legacy industry like health care or finance

or those those types of industries.

And as I was talking to people across

those,

across various companies and organizations

I would ask them what their primary technology challenges were

and this data engineering, data integration,

ML,

you know, doing ML pipelines, analytics, etcetera kept on coming over and kept on coming up over and over again, and you know, I would then kind of go to practitioners in the field and ask them like, hey, can you show me what your workflow looks like and what your tools work like?

And, you know, there's amazing

compartments of technology in this sector, but when you look at kind of the developer

experience or what I'll call the builder experience because it's not just software developers, developers, analysts, and data scientists also participate in this.

From someone with my background

in what I fondly call the full hipster stack, meaning, like, React, GraphQL, and the associate technologies. Kind of the aesthetics

and the tooling is just of not of this of a quality that I was accustomed to. And then, you know, you would go back and talk to these business leaders

after talking to their engineers and they would say something like, listen, our ability to transform health care we think is actually limited by our ability to do data processing.

And

then I remember this meeting distinctly, I was like, wait wait wait wait wait, you're telling me that's what what's prevent what you think is preventing you from transforming American healthcare

is the ability to do the moral equivalent of regularized computation on a CSV file. And they're like, yeah, that's probably it. And I was just like, this is crazy. And that kind of started me on the path of looking into this.

And given the fact that you didn't have a lot of

background context in data management and data engineering before that, I'm curious how you managed to get up to speed and get so embroiled in the overall space of data engineering and data tooling, and how you identified

where to approach the problem.

I mean, it's just pure immersion,

you know,

the I just started

reading and consuming

as much

material as possible and talking to as many people as possible. So,

yeah, I knew it was time to go back to work. I was actually on

on on on my honeymoon with my wife, and we were in a train and I was reading Mateza Harrias of,

the founder of Spark, his PhD thesis on the train. She was like, Nick, this is ridiculous. Like, put that paper down and you're going back to work when we get home.

But so I, you know, and and, you know, Tobias, like, your podcast and podcast like it have been utterly invaluable.

And through those podcasts, I also was able to connect

with like minded people and really get their feedback

and understand what they were doing. You know, I particularly, for example, loved your episode with Chris Berg about data ops. I thought that was super insightful. But effectively, it was just, you know, you just when you start learning something, everyone in the history of the world who knows something

at some point did not know that thing. So you just put 1 step in front of the other, start reading every single thing out there, talking to every person that you know about it, and then just start building and experimenting with stuff.

So from all of that, you ended up creating the Dagster project. I'm wondering if you can just explain a bit about what it is and some of the early steps of getting along that path and understanding how to approach the problem.

Yeah. So, you know, a lot of this comes

from my, you know, everything is biased and through the lens of your previous experiences.

So, yeah, I was definitely trying to think about what are the design principles that led to things like GraphQL that I thought were applicable in this space

And, you know, I started to think a lot about why does programming

in data management broadly,

why does it feel so different?

And it it's and

what are the properties that make it so that seemingly, like, software engineering

practices end up being different in this domain than a traditional, like, application domain.

And as I was thinking about that, 1 of the properties of these systems that struck out to me is the relationship

between the computation and the underlying data.

Meaning, in a traditional application,

you have, let's say, a single database table

and that is manipulated in a transactional fashion, meaning there's lots of different pieces of software and entry points into the system that are both that are mutating

that table. Right? Like, this user updates this setting from this this endpoint and this other user updates this other setting from this endpoint all of that shared state. What is different about 1 of the big things that's different about this domain of computation is that typically there's a 1 to 1 correlation

between a data asset and the computation that produced it. Meaning that if there's a data lake somewhere and there's a set of par k files that are being produced or just say simplify just a parquet file.

Typically, there has only been kind of 1 logical computation

that has been producing that thing throughout time. Meaning that you have a function somewhere let's say just in a very abstract sense like a piece of computation like a spark job and it's been producing daily partitions

of in parquet files over and over and over again. And there's a 1 to 1 correlation

between that set of partitions in the data lake and the computation that produced it. And you can actually generalize that

to, yeah, almost anything, like, all of these systems whether they're call them ETL pipe lines or ML supervised learning processes or whatever

are typically just

DAGs of functions that consume and produce data assets. And what was really interesting about

focusing on the computation itself

is that that is actually kind of a more essential definition of the data data than the data itself in some ways. Let me give you an example. Imagine that you had a computation that said that, hey, I'm a computation and I produce a sequence of tuples that have strings and ints,

right, and imagine that you could actually,

you know, in a in a in a really standardized way instruct that computation to conditionally either generate

a CSV with that schema

or a JSON file with that schema.

In reality the it's the computation that is the source of truth there and not the produced,

CSV or JSON file. So it was kind of this, like, hey, why don't we start focusing on attaching metadata

and a type system

and a standardized API

around these broad computations

instead of the data itself, and that was kind of the fundamental insight that led to the project.

And in the tagline, you use the term modern data application

for

familiar with such as ETL framework or for building data pipelines. And I'm just wondering if you can describe your thinking in terms of what you mean when you say data application

and some of the main types of use cases that DAXTER is well suited for.

Totally.

So, let's frame this by talking about the term ETL. So, and again, I think this is part of the benefit of me coming in fresh to this about a year and half ago and kind of assessing, like, what are the what's all this different terminology that I use and why does it exist?

So specifically,

ETL,

let's talk about that term. Extract transform

load.

And its historical etymology

is you have, you know, the traditional 1 is, like, you have Oracle systems and you have a transactional database on 1 side and you have a data warehouse on the other side. And every night, you do a 1 time transformation that extracts that data out of the transactional database, does some computation on it, and then loads it into a data warehouse.

So my what people call

ETL today looks nothing like that. It looks absolutely nothing like that, meaning that it is typically multi stage, it has multiple stages of materialization.

It typically passes through multiple different computational engines like the ingest might be through Kafka or a tool like Fivetran and then it might be in a data warehouse for a while or maybe then Spark will operate on it and then different systems. So the term ETL is no longer

no longer attached to its original definition. When people say ETL today, they effectively mean

any computation

done,

in the cloud. And the other thing which I think and the reason why we're kind of interested

capturing a new term called data application for these is 1, I believe that ETL, data pipelines,

supervised learning processes are all in effect the same system.

They are graphs of compute that consume and produce data asset. Right? Within every ML pipeline, there's an ETL pipeline. There's just 1 additional step that produces a model. And the other thing is that,

you know, the and this also kinda comes from their origin story

is that I really view

this domain as in a similar spot

to where front end engineering was

about 10 years ago.

And back

then

yeah. And the reason why it comes from this is that if you talk to anyone in data today, they'll say something

like, I spend 10, 20% of my time actually doing my job and 80 to 90% of my time data

cleaning. And I was when I kept on hearing this from people, it actually kind of gave me started giving me these, like, flashbacks to talking to front end engineers and say 2010 within Facebook, and they would say, like, I spend 80 to 90% of my time

fighting the browser and 10 to 20% of my time building my app. And, you know, React really changed the world

on that front. And 1 of the things it did is it no longer thought of front end

as kind of a sequence of scripts

that are stitched together and then you touch once and you never talk to them again. It's, like, hey, these are not really complicated pieces of software.

We need a framework that respects the problem and the discipline and, like, is lives up to the inherent complexity that's in those apps. I think that data is in a similar spot whereas, like, these are no longer just, like, 5 scripts

that are stitched together in a DAG that you have to run once a day. Like, those exist,

but in reality, we are in a much more complicated world.

The ETL, what we're hitherto known as ETL pipelines are much more intermixed

with the business logic of your application.

Meaning, like, often you're doing transformations that stream in data

back into the app, and there's kind of a reflexive relationship

between the data pipelining and the core behavior, the your your front end application.

So these are just, like, far more complicated things now,

and I think you need to think of them as applications. Meaning, like, they have they're they're alive all the time. They have up times.

There's complicated relationships within them you have to think about it not just as a 1 off script

but you have to respect the problem

write testing for

it really start to model these things in a more robust way that's amenable to both human inspection, human authoring, and tooling. And so, you know,

that's kind of why we're referring to those things

as data applications,

because data applications are multidisciplinary

application

world

is partially caused

by the siloing

of terminology,

actually. But they're all collaborating on the same activity. And I would also argue that application engineers

are starting to bleed into the data engineering life cycle as well, where previously

ETO engineer or data engineer would be responsible for pulling information

from the system of record that the application uses.

But as we get to more real time needs and the requirement of incorporating data

as it's being generated,

the application engineer needs to be aware of what the overall systems are that are able to process that data downstream,

particularly with the introduction of systems such as Kafka

as the sort of centralized system of record for the entire

application ecosystem, both for the end user applications and for the data applications. And so I think it makes sense to have this

unified programming framework that everybody can understand and everybody can work together on rather than having them be componentized

and,

monolithic and fully vertically integrated.

I I couldn't agree more.

And

the the days,

of, you know, this is why I, yeah, I mentioned it earlier, the

your interview had Chris Berg,

about DataOps,

you know, you're kind of in different words describing the DataOps vision of, like, you know, there used to be this you know, if the analogy is DevOps, used to be the siloing between developers and operations, and now developers are responsible for operations to some degree in that there's, like, a programming model where they can program the ops. And I think we need to move to a similar world here

where you can have self contained teams that are responsible for building the app and building

the and deploying it and also integrating it with your data applications internally

because the people who wrote the apps know the most about the domain of their data. We shouldn't be living in a world where an application developer can wake up willy nilly

and change their data model and then break everyone else without being responsible for that.

So can you take a bit of time now to talk a bit about how Daxter itself is architected

and some of the ways that it's evolved since you first began working on it? Totally.

So, you know, Daxter,

you know, if if you look at it, I think someone once called me, told me it's like, oh, this looks like fancy Luigi.

So, you know, at first blush, it definitely looks like a fairly traditional ETL framework.

I think what distinguishes it and how it's architected is that

we are very focused on allowing the developer to

express what

the data application is doing rather than just how it is executed.

So

if you look at something like airflow,

right, the primary abstractions there are just they have operators which then you create tasks

and then

you build a dependency graph. Right? And if you open up that UI, right, all you see is kind of a series of nodes and edges and those nodes have

like a single

string that describes what it is and then there's edges between them and that's all the information that you have.

And

the prod the goal of the system is to orchestrate and ensure that those computations

complete,

and that you can retry them,

and things of that nature.

Daxter's

primary focus although we do do some of that execution as we'll get involved but the primary focus of DAGSTER is enabling the developer to express at a higher level of abstraction

what those computations are doing. So, when you write a solid

it's a general you know there's a type system that comes with Dijkstra,

so every single solid is a function we say that hey every single

node in the graph is actually a function that consumes something and produces something and you should be able to express that, you should also be able to

overlay types on top of that so that you can do some data checking as things enter and exit the nodes

as well as express to tooling exactly what is going on with this thing. These things can also express how they get configured we have strongly typed configuration tools

and then as the computations proceed

they actually inform the enclosing runtime

about what's been happening meaning that hey I produced this output,

Hey, I actually created a materialization that outlives the scope of the computation.

I just passed this data quality test, etcetera, etcetera. So our focus is much more on kind of this new call the application layer,

right, for data management,

and that is the kind of the primary focus for our programming models. And since

Daxter itself is focused more on the programming model and isn't vertically integrated as I mentioned before, as opposed to tools such as airflow that people might be familiar with, I'm curious how that changes the overall

workflow and interaction of the end users of the system,

and what your reasoning is is for decoupling the programming layer from the actual execution context.

Yeah. So, you know, the world of

infrastructure

is changing a lot,

and, you know, what we what, you know, Airflow is an existing let's just talk about Airflow specifically. Right?

You know, Airflow is a very vertically integrated system

and it

they are it has a UI,

it has an execution,

like, cluster management aspect to it,

and

it also,

you know, has this user facing API such as such as it is.

And, you know, because it's not

layered as much, they haven't been able to move as quickly. Like, for example, they've they've, you know, airflow still doesn't really have a coherent API layer

such that you could build,

you know, really move quickly

on the front end of that system in a decoupled way. But I think what's more interesting is that the world of infrastructure

is just changing

a lot.

And,

you know, just to go back to the previous comment about that, like, Daxter's primary concern

is about

what

the

what the data applications are doing rather than exactly how they're doing it,

the how

is of what these systems are going to be doing is going to be changing a ton over time. So I think there's going to be lots of different physical orchestration engines

as new different cluster computing

primitives come along. So you know just for example there's like

out there there's Dask which you can use for cluster management if you just want to kind of do Python native and then obviously people are really interested in interested in computational workloads on Kubernetes, but I don't think Kubernetes will be the end all

of all, you know, compute infrastructure for all time. And so I just think that world is moving very

very quickly

and you want to be able to also

be able to use a new software abstraction on existing legacy infrastructure.

Right? So this

this in some ways comes from my experience working with GraphQL.

And 1 of the things I was really pleasantly surprised about open sourcing GraphQL

was just how effectively

it penetrated legacy enterprises

and the reason why is that GraphQL is a pure software abstraction

that you can overlay

on any programming language,

any run time, any storage engine, any ORM.

And that meant you if your front end people with GraphQL

wanted to go in

and use GraphQL but overlay it on top of some legacy IBM web sphere something or other, you could actually have someone write a GraphQL server which interact with that thing and that was an extraordinarily

powerful operating modality

for an abstraction to really have a lot of impact not just

amongst the Cognizant building

greenfield apps but in an industry wide scale.

So we kind of approach this in the same dimension of what we like to call a horizontal community platform

meaning that yes these are just DAGs of functions

and by functions I mean like a coarse grain computation meaning like a spark job,

a data warehouse job, or any sort of legacy process that you have in your system.

You should be able to orchestrate

those computations

on arbitrary compute

based on your needs and your requirements

and then but regardless of what's actually doing to compute

and what's physically doing to orchestrate it, there's still a ton of commonality between all those things and that's where it's kind of the the what part of me of what Dexter is describing meaning like it has types, metadata, etcetera. I mean, we have common tools that can operate over all of that. And you can just actually see this trend

of moving away

and unbundling vertically integrated stacks kind of across a few domains of computing

all the way from content management systems to other systems,

and I think this is kind of part of that. Yeah. I think having these different composable layers

provides a lot more longevity

for each of those different layers

independently. Because as you said, today, you might want to use airflow as your actual execution to context. Tomorrow, it might evolve to Spark because your scaling needs have evolved.

Or maybe there's some new framework that's coming out that you want to be able to leverage, but you don't wanna have to rewrite all of your computation

just because you're running across a different actual execution engine.

Yeah. And if there's nothing else, the other thing that I really noted coming at this industry fresh

is

just how heterogeneous and fractured it was. Meaning that in when you have

kind

of a kind of a coherent or typically you're crossing

3 or 4 technology boundaries with dealing 1 of these things. And in a in a legacy organization where there's complicated or maybe they've even done acquisitions and stuff the data the data infrastructure heterogeneity

is absolutely mind boggling.

So having this kind of, like, this single

opinionated

layer that all it does kinda does is describe what's going on and make it in a way that can integrate with legacy, both computational engines and legacy infrastructure,

we think is really powerful.

And just quickly, I'm curious to what your evaluation

process was to determine

that Python was the right implementation target for Dagster and what other

language runtimes or frameworks you might have considered in the process.

So it wasn't actually, you know, I'm when it comes to languages, I'm much I'm a pragmatist.

And for these type of systems

where you want

a wide variety of personas,

interacting with it successfully,

but still being able to build so called real software

and in the data domain, I don't think there's any choice but to use

Python.

Python has a lot of good things going for it. 1, just like everyone in data is accustomed to it.

It's highly expressive,

so you can with it just it just a very for these kind of metadata,

metaprogramming

type of frameworks extremely useful to use. Python also, and I think this is the reason why it's been successful in this domain

is that as a programming language it has just a vast dynamic

range.

Meaning that I think you can grab anyone

who,

say is proficient enough to do something complicated in Excel

and you can put them in a Jupyter notebook and they can do meaningful work, but

you can also build Instagram

on top of Python,

and that's kind of Python's superpower.

So

that, you know, what are the other choices

in

the data domain, you know, you could a lot of Scala

is in Vogue. Scala does not have that dynamic range.

You cannot plop an Excel user,

into a Scala program

and expect them to be successful

and then, you know, they're really, you know, what other languages should I have chosen or should I have considered? Actually, you know, it's 1 of those things where I don't think I even, like, really considered another language because to me the choice was so obvious.

And then going into DAXTER, I'm curious, what were some of the main assumptions that you had, and have those been challenged or updated as you have

put DAXTER in front of more end users?

Yeah. That's a great question. And, actually, it's kind of difficult

to go back in time

and

reconstruct

exactly,

you know, what my thinking and then the team's thinking has been

in every step along the way. I think,

you know, most recently,

I actually think I still think this is the correct architectural

decision,

but initially, we were very focused on the kind of the use case

of of hey, I'm a team, I have an airflow cluster,

I want to,

you know, have a higher level programming model

on top of airflow

such that my team is not or people on my team are not manually constructing airflow DAGs, they're programmatically generating those DAGs from some other API in our case, DAX do. Because it was a pattern we saw over and over again. Typically, most airflow shops of

sufficient complexity have built their own

layer on top of airflow that, you know, for whatever reason that's specific to their domain or their context,

actually programmatically generates those Airflow DAGs. So we were really focused on the incremental adoption case, but our early users, a lot of them came to us, they're like, hey, you know I think it's really cool that you have this airflow integration and actually kind of proves

that the system is interesting and generic and that we won't be locked into anything but

for right now I just really like your front end tools

and I just wanna be able to build kind of like my greenfield app on top of this and kind of a 1 click, you know 1 stop shop sort of way

and that's actually

what we've been working on for the last couple months is coming up with a like you know, a kind of Daxter native,

you know, vertically

integrated instantiation of a Daxter system that has a scheduler

and a lightweight execution engine

along with some DevOps tools so you can just, you know, essentially, like, write

we call it you know we have a library called DAX or AWS

DAX or AWS in it and it spins up a node and AWS for you, spins up an RDS database and you're kind of you can go from hello world to scheduling a job in about 2 minutes with a beautiful, you know, hosted web UI to monitor and productionize your your apps.

So, you know, we kind of started out with this,

you know, horizontal

integrate with everything approach but don't be super opinionated

to actually we do have like 1 instantiation of an opinion which is like you can you know have this kind of like out of the box solution but the architecture is still there to integrate it with other systems and other execution contexts. So,

you know, I think we've changed our initial target market at first and then I would say the other thing is that related to that is that

you know this started out

as a much more kind of vanilla

ETL framework

and the

the insight that allowed it to eventually

target the different execution engines has definitely been an evolution

in order to kind of, so that that thinking has definitely changed,

along the way, but I would have to think about other things in order to answer that question more fully. And then for somebody who wants to extend DAGSTER

and either integrate it with other systems that they're running

or add new capabilities to it or implement their own scheduler logic, what are the different

extension and integration points that Daxter exposes?

Sure. So we can go those 1 by 1. So,

for example, if you want to

use a new say compute engine like a new Spark. Let's say you're using Spark, but you really wanna experiment with, say, there's a there's a new kind of not Spark successor, but a similar system that does distributed computation called Ray, for example. It's like, okay, I wanna write I wanna be able to use Ray within Daxter. Well, all you would do is you would kind of write 1 of these what we call solids,

that generically can kind of interact and

and wrap kicking off a ray job

and all you need to do is kind of look at the way we integrate with Spark and

data warehouses today and kind of use those as patterns in order to build your own your own solids and you're off to the races. So literally wrapping

existing computational frameworks

is relatively straightforward

and you can cargo call that from our open source repo.

Another example you had was say I wanna be able to use my own scheduler

for whatever I want. Well,

the Daxter is fully built

on a GraphQL

API.

So the system is very pluggable. So it actually be very straightforward to kind of implement your own schedule logic because all you would need to do is,

you know based on some schedule

essentially execute a GraphQL mutation

against our hosted installation

or your hosted installation, you'd be able to enqueue

jobs to be run. In terms of, you know, if you wanted to

execute this on a new orchestration engine, right, we also have kind of a pluggable API

for that and, you know, all those examples are also checked into our resource repo. Right now we have integrations

with Dask

and with airflow

where effectively we've written code

that allows you to take a DAGSTER representation of a pipeline

and then

effectively compile that

into either an airflow DAG or a DAS DAG

and you would,

you know, if you wanted to use another execution engine in order to do that you would just kind of

mimic that process.

So the system is designed

for pluggability

through and through. And another component of

environment that somebody might want to be able to integrate DAXTER with is their metadata engine to be able to keep track of data provenance

and being able to identify

what are the transformations

that are happening.

And I'm curious what would be required for somebody to be able to extract all of the task metadata to integrate into that system?

Yeah. So that's a great question. You know, the the system is definitely designed with that in mind, meaning that,

you know, you whenever you execute a solid, you know what the inputs and output

types are. But in addition to that, those solids can also communicate

that, hey, I have created

this what we call materialization

that will outlive the scope of the computation.

So

you can subscribe to those events via GraphQL subscription

or you can just, like, consume them with our Python API.

But what that allows you to do

is the a tool which is consuming those stream of events has an enormous amount of context about what's going on. It knows, like, when the thing was executed.

It might know what container has been executed in.

It knows what configuration

file was used, meaning, like, the Dax configuration file was used to kick that off, and then it gets runtime information

about the materialization, and it's a total user pluggable

kind of structured

metadata system. And so, you know, definitely on our road map is for us to build our own metastore on top of this, but it's meant to be very pluggable where you could just write a generic

facility which consumes

these events and every time a materialization

is consumed

you would be able to actually

persist in a metadata store

enough

state to have full lineage and provenance

on that produced materialization.

So we don't have anything out of

the box to support that right now, but it would actually be pretty straightforward

to integrate that

with an existing meta store,

and we are just really excited about that direction. So if anyone wants to do that, please come talk to us, because

we love working with people who like to build on top of this. And for somebody who is interested in getting started with Dagster and writing their own data flow or data application,

can you talk through the overall workflow for somebody to be able to define all of the different computation points and integrate it, and then deploy it to production and make sure that the execution contexts are configured properly?

Yeah.

So,

you know, I'll just go

through that quickly. So, you know, the you you know, so what do you start with? Well, you PIP install Daxter. Right? It is just a Python module.

And what you would effectively do is to say, hello, world. You would have a Python file,

and we you would write what we call a pipeline which is 1 of these tags and then a solid which is just effectively a function which defines a computation. So you write a function that function is totally black box you can call

you can invoke pandas

a data warehouse job a spy PySpark job whatever you want and then you orchestrate

we have this kind of elegant DSL for stream

those solids together into a DAG once you do that then what you can do

is you can launch that and debug that with either in a unit testing environment obviously

but also

using our development tool called DAG it so just locally in your machine without deploying anything

you can run Daggett you can visualize the DAG,

inspect it, you can configure an execution of it, we have this beautiful auto completing,

config type system, you can then execute that locally and verify that things happen,

So, you know, the this system

the fact that we've architected it to be executable in different contexts

means that it's also executable in your local machine for testability and whatnot. And we have another different we have also kind of abstractions that help users

isolate their environment from their business logic because this is just critical for getting testing going. Okay. So now you have that working.

The deploying it, you know, with our new release you can actually deploy that in a very straightforward fashion

by kinda using our kind of DevOps tools that come with with this. So once you have that pipeline written, you would then effectively

type DAX or AWS and net and then it would provision

an instance

install the required correct requirements if you have you need to have a requirements dot TXT locally

and then then you're up and running in then you're up and running in the AWS environment in your VPC

and then you can you know we also have a Python API for defining schedules which is just a light wrap around cron and so you can go from kind of like writing

this to

to also

deploying it very quickly.

If you further wanna customize it

then you can actually what we we have this kind of new abstraction that we call an instance or you can think of it as like an installation

and you can configure that thing to instead basically you think of it when you init that AWS environment or init your local environment you can say like hey, instead of just doing a single threaded single process executor

we instead

want to execute this thing on top of Dask for example

and so you could you would also configure your instance,

which is just a YAML file and kind of a well known spot in order to instead of using our native toy executor,

use something like Dask. So it's definitely pluggable on multiple dimensions.

And then 1 of the things that you've commented on and that stands out about Daxter is the concept of strong contracts that it enforces between the different solids or computation nodes. And I'm wondering why you feel that those contracts specifically are necessary,

and some of the benefits that they provide during the full life cycle of building and maintaining building and maintaining a data application?

So this is what struck me about a lot of these systems

is

the amount of implicit contracting

that was in these systems and how frequently unexpressed

they were meaning that again contrasted to airflow

right airflow

if you look at their documentation

they say that if you feel compelled to share data between your tasks

that you should consider merging them, and then actually someone wrote a system to try to pass data between tasks called xcom,

but that is generally not used that much and I believe even the creators of it has kind of, like, been like, that was kind of a swing and a miss.

And so but the thing about that

is that airflow

tasks

are passing data between another

implicitly.

Right so if you have task a which comes before task b presumably the dependency exists only because

a has changed the state of the world

in such a way that it needs to happen before B right and

because

and if you want that

to be testable

it has to have parameters

and you have to pass data between them, so to me this wasn't some

massive

realization

like I think everyone understands that there's data dependencies

between these things. It's just a question of whether you express them or not in the system. I think it is critically important to express them

for any number of reasons,

you know, both in terms of human understandability

mean like you can actually inspect this thing in a tool and understand what the computation is doing

to ensuring

or guiding your users

to write these things in a testable manner because if you can't pass data between tasks

there's no way you can test those tasks

right

and I just think that

these all these data applications

are DAGs of functions that produce and consume data assets and that they should be testable you should be able to execute arbitrary subsets of them and in order to do that you need them to be parameterized

and some of the parameters need to come from outputs of the previous task which means ABB functions.

And then also there's you know really interesting

operational properties that come out of expressing your data dependencies.

It's a fundamental

layer on which you could build say fully incremental computation

and have the system understand how it should memoize,

produce data

and other aspects and you know this all kind of you know Max of airflow flame Max Boceman who's also been on your show you know, has written a couple blog posts about this which kind of, you know, has been influential in my thinking about so called functional data engineering. So I just think it's the right way to build these systems on any number of dimensions

and I think you can get a lot of value

by expressing those data dependencies,

those parameters

in your computational graph.

And you mentioned testing a few times in there. And I've got a couple of questions along those lines.

1, in terms of how Daxter itself facilitates the overall process of testing and some of the challenges that exist for testing data applications, and how you approach it. But also, I'm curious

how you approach

defining the type system for Dagster

to be able to encapsulate some of the

complex

elements that you need to be able to pass between things, such as things like database connections to be able to identify that there was some change in a record set or,

an s 3 connection for being able to define the fact that there were some parquet files

dumped into a particular bucket. Okay.

Well, I feel like you just asked to those those are 2 questions we could fill up an entire podcast on, but I will,

do my best.

So you asked me how DAXTER approaches

testing,

and this is a huge and important

subject. Not that how Daxter does it but the testing and the kind of data domain in general.

Everyone acknowledges that's really difficult and hard

and it's 1 of the things that I really noted when I first was learning about this, so

in terms

of you know things that make this domain different from application

programming,

you know, the same developer

operating within an application, a traditional application would be like writing lots of unit tests. You take that same human being,

move them to writing 1 of these systems, and all of a sudden they're not writing tests anymore because it's a fundamentally different domain and it's harder to do testing on. So when I think about testing in this, I think about

kind of 3 different layers.

Let's go through them. So 1 is unit testing the other is integration testing

and then we'll call it

pipeline or production testing

and each of them in this in this environment

has their own issues.

So unit testing this stuff is hard

and a big reason why is typically

these systems

have dependencies

on external pieces of infrastructure which are effectively impossible to mock out or very difficult to mock out. You know, this is 1 of the reasons why

we built in a system

where

1 of the things that we do

in Daxter is that we flow around a context object throughout your entire computation

and the goal of that is instead of anywhere where you would just kind of grab

some global resource

like a connection from a 5 like a database connection with hard coded values or a spark context or whatever,

you instead

attach those things,

those same exact objects to our context object and that allows the runtime to control the creation of that context and therefore

kind of with the same API

be able to control the environment that our user is operating within

so what I mean by that is instead of just saying like you know global function getcon

as in get connection you would instead say context.resources.connection

and what that allows you to do is based on how you configure

your computation

and then specific instantiation

you can kind of swap in a different version of that connection so that you can test this stuff in a unit testing context without chasing the business logic

and

you know so and but the thing about the data domain is that you can't capture as much stuff in unit testing

as you could

in probably application development because this external dependency stuff, but you can still do a lot in the unit testing environment. You can make sure that,

like a refactoring process worked or if you renamed a function,

you know, if you testing out that your configurations

are like actually like being parsed correctly. There's all sorts of changes that can be covered there, and I think it's critical part of the process in the CICD pipeline. Okay. Next, integration testing.

So this is more like, hey, we can't mock out

our spark cluster because actually, like, mocking out a spark cluster would be an entire company

that's an entirely complicated piece of software itself but what we do want to be able to do is

easily parameterize

the computation so that hey in the integration test environment we actually spin up a like very tiny test only spark cluster

and ensure that we can, you know, run it on a sub sample of the data

and still have something

happened in the verifiable way. So that integration testing layer, that's what I'll emphasize on our built in configuration

system. So in order to make these

pipelines testable, typically,

they become

extraordinarily

complex functions with tons of parameters

that configure both how they're interacting with their environment and also

where they're getting their data from and

so we fully embrace that and we built this kind of configuration management layer that makes it easier to manage

complex configurations

and 1 of the goals of that is to enable better integration testing

so that in your CSV pipeline or locally you can kind of have different instantiations

of config

that will do full or partial integration tests as your pipeline.

That way you can kind of slice and dice your pipeline

into whatever subset you want

of and then be able to execute it within different environments.

And then the last component of testing

in

these data pipelines which we also have full support for within DAX there's a notion of pipeline or production tests now

this thinking in this area I'm deeply influenced by

Abe Gong the creator of Great Expectations

who kind of uses the term pipeline test in order to describe this and what it says is like, you know, 1 of the differences between this and traditional programming is that in data pipeline or applications,

you do not have control over your input instead

you're typically

ingesting data from a data source you don't control and you know that means that if the data changes some assumption about it changes you know it's like think about this you like you look at some CSV file you write some code and you're parsing a CSV file there are a bunch of implicit assumptions in that code that you wrote in order

to load it incorrectly and then if in the next days data dumps some assumption about that changed

and it's not part of a formal contract your code will break. So, what expectations

are or data quality tests is the notion that, hey, instead of having these

like implicit contracts between your incoming data and your computation

which only end up getting expressed when things break

instead why don't you front load that and say that hey in order for this computation to work the incoming data

has to conform

to this data quality test meaning like I expect the 3rd column

to be named Foo and for at minimum 1% of them to be null and the rest of them to be integers greater than 0. And because you cannot control the ingest, the only way you can test that and know that it happened was at production time.

This is much more

like a manufacturing

process

than a traditional software process, meaning that the data is the raw material,

you're getting it shipped from some place but before it goes into the machine you need to do tests on that raw material to ensure that it conforms to the requirements of the machine that you might just be breaking right now. So, you know, we

through our abstractions

try to guide the developer

and with tooling built on those abstractions, help the developer kind of execute all those 3 layers of testing

which are all necessary for a well functioning

system. I believe your other question was about our type system and things that you pass between the solids correct? Correct. Alright. So, the type system of DAXTER is a good way to transition from those production tests because when we went into this it was like you know, because of that

that property that I was talking about where you typically don't control the ingest

and this kind of like the vast heterogeneity

of systems that are used to process this. What can the type system in 1 of these things that proclaims to, you know, spam programming languages

and spam different

computational system. What what can it actually do? And we actually came to the conclusion

that for now, like, the most simple and most flexible thing for a type

to

say in DAXTER is when you say, hey, I have solid

it has an input of type Foo all that type Foo

says at its core

is you provided a function

that says hey when the value is about to be passed into that solid it needs to be able to pass this test so literally the core kind of capability

of a type in the DAXTER system is that there's just a function that takes an in memory value does an arbitrary computation

and then returns true or false, plus some metadata about what happened.

So it's a totally dynamic,

flexible, and gradual typing system

that allows the user to kind of customize their own types

and do whatever they need to do in order to pass that type check.

The the type system isn't is and the inputs and outputs all are about

the data that is flowing through the system.

The other things you mentioned though

were database connections,

s 3 connections, things of that nature. Those we model on a different kind of a different axis or dimension of the system

that we call resources.

And I was mentioning that, like, context object that we flow through the entire system. A database connection or s 3 connection

is something that you would attach to that context and where the vision really goes

here is that we want to have an entire ecosystem of those resources

so that people are even thinking in terms of higher levels of abstraction. So

like let's take s3 what are typically people using s3 for? Well maybe what you're doing is all you're doing is saying that hey

in previous parts of this computation

I'm producing a file and I just need to stash it somewhere

and have it saved so that later down the line

a solid can, like, take that

and then do some further processing on it. If you think about that abstraction, right,

like, I think we call it file cache, we have like a file cache abstraction that comes with Daxter and there's like a local file system implementation of it and there's also an s 3 implementation of it, so that you can

do that operation

of stashing files somewhere

for, you know,

to in order to actually perform your business logic, but locally, you can just kind of say, hey, I'm I'm operating local mode, use the local version of that file stash resource,

but then in production

it's operating a cluster community environment you give me the s 3 or GCS version of that same exact extraction And so then things that like s 3 connections, database connections, and the things stacked on top of that, we model as resources because they're not business logic concerns,

they're operational concerns.

So our goal is to kinda have the type system and the data quality tests be about the data, the meaning of the data that's flowing through the system, and have the context and the resources aspect be about more environmental or operational concerns.

And on top of all your work on DAXTER itself, you have created a company

to be sort of the backstop for it in the form of elemental.

And I'm wondering how you're approaching the overall sustainability and governance of Dagster,

and what your path to sustainability

and success for the business happens to be, and how they relate to each other. Yeah, that's a great question and I think about this stuff a lot, you

know, the the open source government stuff is top of mind for me actually because

about a year and a half ago, Lee and I, Dan, the GraphQL creators

kind of spun GraphQL

GraphQL out of Facebook

and started the GraphQL foundation which is now,

you know, run-in concert with Linux Foundation

and that's been a really interesting experience

and then there's also been kind of a lot of,

for lack of better term, Hoopla

around open source sustainability

across many dimensions about should there be new licensing regimes,

what's the relationship between the cloud vendors,

what is proper governance?

So

my belief is that I like to have a pretty

clear wall of separation

between an open source ecosystem

and any commercial entities stacked on top of it or associated with it. So I deliberately chose the name elemental

to be different from Daxter,

and my goal

in terms of how they're related

is that the relationship between Daxter and Elementor I hope will be

similar to the relationship between GitHub and Git

structurally, meaning that Daxter will be an open source project that will forever and always be free. It's not me type of thing where we just kind of flip feature flags and have enterprise features for it, it'll be a self contained,

governable

open source project with very well defined properties and very very well defined boundaries

such that in the future

we can have a neutral governance model that

will work well. Simultaneously though we are also trying to build a sustainable

business

and a healthy business and that's also our goal and that's where elemental comes in. And the reason why I like the GitHub Git analogy

is that

GitHub is a product they chose to make a bunch of it closed source it's hosted there's a login

you have users that do stuff people are happy paying for it. It leverages the success of Git,

right, but it has its own product dynamics and whatnot. It's not just pure host to git, and then there's this cool dynamic right where

GitHub made it easier to host git which actually

increased the adoption of git, made git more the obvious winner which then increased the popularity of GitHub and there's kind of this reflexive relationship.

This is what we want to do eventually with the product that will become elemental.cloud

or whatever you end up calling it and DAXTER,

so elemental will eventually be a product that would leverage the success

of DAXTER meaning that if your team

has adopted DAXTER as a productivity tool,

it will be natural, compelling, and in everyone's best interest to adopt elemental

as your data management tool that leverages the adoption of abstraction.

And I think that's, like, very

everyone's incentives are aligned

if you do that well

and

you know you can kind of clearly communicate to your users that like you're not going to be hoodwinked into being if if you're just using this for a pure productivity tool that's totally fine with us and godspeed

and it's our job to build

data management tooling that leverages that

such

that the enterprises

that, you know, contain or employ those developers that use DAXTER

feel really good about having a commercial relationship.

Before you are comfortable cutting a 1 dot o release?

Oh, the the ever present question of a 1 dot o release.

You know, to me, just to, you know, the the the future road map, you know, I certainly think that you will see us,

1, you know, effectively based on dynamic

user feedback kind of prioritize

integrations with specific parts of the ecosystem.

So after this 0 6 0 release

I will imagine that the tools will look more compelling to people in which case certain aspects of that tooling will be like hey I understood that you had this Airflow integration,

I'm really interested in using this other tool that I see my friend using but I still can't move my company over

off of Airflow 1 whole shot what can I use as a value add for this over Airflow? So we anticipate kind of maturing

our integrations with other different technologies but that will be based on user demand. I think the other thing is that you'll see us building

kind of more and more tooling

off these higher order

layers

of the

computations be able to say like hey I did this data quality test

I produce this materialization so that means

you know you could say you know you can name off any number of things you can do based on that like a meta store anomaly detection, data quality dashboards

all sorts of other stuff, but I think for the next you know 1 to 2 months it's gonna be

a very kind of more meat and potatoes type time where based on feedback,

based on ergonomic issues,

based on operational issues that come up, we will be evolving the programming model or documentation

and kind of, you know, doing a getting back to basics type. In terms of a 1.0 release,

to me this is about this is mostly about

communicating

expectations

to the users

in terms of like hey this is like an API we're gonna stand behind for years

and really commit

to backwards compatibility

and inaccessible

in a, like, really really serious way

and, you know, we're super confident that this is like the base API layer for the future of the system.

We still have a few kind of iterations to get with that, we're not gonna be breaking people willy nilly on this stuff, but I suspect that based on

like, based on user feedback and how this stuff gets evolved

gets used organically

around this process that we will be changing some APIs

and maybe even like taking the system in different directions. So to me the whole 1 dot o is mostly about

external communication

and about expectations

for the future users, and it's more of a qualitative judgment than anything else. Are there any other aspects of DAXTER or your work at elemental

or your thoughts on the space of data applications that we didn't discuss yet that you'd like to cover before we close out the show?

I guess 1

aspect of 1 thing I'll say

is that I think most of these systems, and this goes back to kind of like what's the deal with this new terminology

aspect

that is that they over specialize

so there's lots of people who build ML experimentation frameworks for example that are totally and wholly separate from their data engineering practice

and all these things end up having to coexist within the same data app anyway and so I think a lot of these tools are overly specialized

like 1 thing I'm really excited about in terms of tooling that we'll be able to build is that it it will be it will be very straightforward to build ML experimentation framework over DAXTER

because you can use an API to enqueue new jobs

with different configuration parameters, right? Which is what you need to do in in order to say do a hyperparameter search or things of that nature and you know you should be able to use effectively just a lightly specialized tool over the same ecosystem

to do ML

experimentation rather than use an entirely different like domain of computation.

So, yeah, we just we very deeply believe in this kind of, you know, multiple dictionary

aspect of it like 1 other integration that we didn't really talk about is that DAX here is a first class integration with Papermill

which I believe you've done an episode on

and that what that system does it allows you to turn a notebook into a function, a Jupyter notebook into a course grading function effectively and then we in turn allow that to

easily wrap that within a solid.

So

you know, yeah I guess what I'd like to emphasize is that the multidisciplinary

aspect of this that's a way

for people to describe and package their computations

in that are actually encoded in different system but express them in a similar

and wrap them in similar metadata system in the same vein we actually have kind of a prototype quality

implementation

integration with dbt as well

where you can have a DBT project which is authored by an analyst or a data engineer and then wrap that as a solid

and then you can execute it within the context of 1 of our pipelines

and that solid will communicate, hey, this DBT invocation proves these 3 tables and these 2 views etcetera etcetera.

So, yeah, I think that,

you know,

this we need this sort of unification layer and that's what we're trying to do. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I mean, I guess it's pretty self serving, but if I it it would be an issue if,

I was working on something,

and then thought the answer was totally different from that,

you know. I guess, you know, from what I see, the gap is

yeah. The gap in the ecosystem is somewhat Daxter shaped, I'll say.

Meaning that we just, like, I don't think

the gap is the 1 next good cluster manager or just the right ETL framework that's drag and drop that extracts away the programmer or something.

This is a software engineering discipline

and so I guess I'll just kind of answer the question is like the biggest gap in tooling

is not trying to is just in the abstract tools

that instead of trying to abstract

away the programmer

and try to instead try to kind of more up skill people who don't consider themselves

programmers to participate

in the software engineering process

and to really treat these things seriously

as applications

and not as kind of these 1 off scripts or something that you just wanna wanna, like, drag and drop once and be done with it. This is why 1 of the reasons why I'm such a huge fan of DBT because I think 1 of the reasons 1 of the things they've been able to do is take people who don't conceptualize themselves as software engineers,

analysts and essentially

through a really nice product

be able to

allow those analysts to participate

in a more industrial strength software engineering process and I think that direction is super exciting

and we're trying to do that and trying to enable those type of tools with Daxter. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Daxter. It's a tool that I've been keeping a close eye on for a while now, and I look forward to using it more heavily in my own work. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for having me.

Listening. Don't forget to check out our other show, podcast dot in it at Python podcast dotcom to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and

tell

your

friends

and

coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links