Dagster's New Era: Modularizing Data Transformation in the Age of AI

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way.

Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files.

Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries.

Go to dataengineeringpodcast.com/coresignal

and try Core Signal's self-service platform for free today.

This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust?

Are broken pipelines and silent schema changes wreaking havoc on your analytics?

You may be experiencing symptoms of undiagnosed

data quality syndrome, also known as UDQS.

Ask your data team about Soda.

With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000

rows in just sixty four seconds.

And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again.

Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.

Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data.

Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.

Visit dataengineeringpodcast.com/soda

to sign up and follow soda's launch week, which starts on June 9.

Your host is Tobias Macy, and today I'm welcoming back Nick Schrock to talk about lowering the barrier to entry for data platform consumers and the impact that the current era of AI has had on data engineering.

And so, Nick, for folks who haven't heard you before in your numerous past appearances, can you give a quick introduction?

Yeah. Tobias, thanks for having me back. It's always a pleasure. So my name is Nick Schrock. I'm the CTO and founder of Daxter Labs. We're the company behind Daxter, which is a data orchestration framework.

And, yeah, kinda my background is I cut my teeth at Facebook engineering.

And the project I was most known for prior to Dagster was GraphQL, which I initially created, and we

open sourced it. Well, first, we built it inside Facebook, and then we open sourced it. And it become a broadly used technology. But then I moved on to data engineering and data platforms, and

I've been working on Daxer for quite a while now.

And so we don't need to get too much into the usual flow because we've covered that in past episodes. But I think that given the

current state of the industry and all of the hype around AI, I'm wondering if you can just start by

giving your summary of the impact that the

overall adoption and growth of AI and automation and agents has had on data platforms and data teams?

Yeah. That's an interesting

question.

I guess, just to set a framework for this, I think there are varying degrees

of how people view AI.

I like to joke around and call it, like there's, like, AI boomers

and AI doomers.

And AI boomers are like, oh, we've seen this before, blah blah blah. And AI doomers

are like, this is the end of all

human labor.

It's going to zero. We should just bomb the data centers, literally.

And I am I would call it squarely in the middle. I think it's gonna be incredibly disruptive. I think it's gonna be

as important as, say, the transition from the industrial

to the information age or the, pre industrial to industrial age.

But I think that it will be a massive

productivity boost to lots of people, including including engineers. And

far be it from software engineers going away. I actually think it will expand the number of people writing software and will make them more leveraged. So, you know, in terms of its impact on the software engineering in this industry, I'm very far

from being a a doomer. I think it will be a renaissance in software engineering, and it's super exciting, but it will fundamentally change that. In terms of its impact on the data platform space specifically,

I think in reality in the day to day lives of practitioners

working in data platforms, it's kind of like there's been an earthquake.

There's a tsunami out there, but it hasn't really hit shore yet.

And what I mean by that is that I think lots of people are using AI tools

to write software

in an accelerated manner.

I think lots of people are starting to work

on AI projects at their various organizations,

you know, especially

use cases involving structured data.

And I think some of their

tools outside of their code editor do have AI features,

but I don't think it has

fundamentally

changed their world

yet. So I think everyone's kind of

waiting for it

and, you know, figuring out how to adjust to this new future.

But I think it's like the we're at the beginning of the inflection point, so it's kind of this

odd state for most people where

their day to day hasn't changed that much, but they know it's going to in some time horizon. I don't know if that resonates with you. But

Yeah. It absolutely does. And I was actually,

recently giving a presentation

for another company to their engineering team

because they wanted to get my thoughts on the future of data engineering, the impact of AI.

And

I think that there's definitely a lot

of new work to be done,

but the

fundamentals of the work don't change. There's definitely

a lot of

changing in terms of the specifics of the tooling, but in principle, the job of data engineers and and the way that I distilled it for the presentation was that the role of data engineers is to turn raw information into knowledge and enable the business to make use of that knowledge to

either make better products, power the applications that they run, make better decisions, etcetera.

And using data to power these different AI systems

maybe takes a different shape because you're bringing in, vector index versus just a star schema, or

maybe you're able to unlock more value out of the unstructured data that you've been storing since Hadoop hit the scene.

But,

ultimately,

the core fundamentals of the job are the same where you find data that can be used for something. You run it through some sort of transformation. You get it into a manner where you imbue context and semantics into the data beyond just the raw bits, and and then you feed that into some downstream system, whether that's business intelligence or an LLM or web dashboard or whatever might be the case. So the work is fundamentally the same. It's just the shape of it is evolving a little bit, and the speed is probably increasing.

And I think that there there's another interesting aspect of it where the way that I build the distinction is that you're either building for AI where you're using data to feed it into an LLM or you're building with AI where you're actually using the LLM to generate that transformation and help you iterate faster and find anomalies, etcetera.

Yeah. That's right. That makes sense. You know? An analogy I like to use is that when accountants first saw the spreadsheet, they were probably like, it's over. Right? And when people saw calculators, they were like, oh, no one's gonna need to know math anymore. That's fundamentally not true. You need to know the principles in order to evaluate

and use the tools.

I do think that

computer you know, data engineering especially is so critical, and evaluating correctness requires so much business context

and localized context that it will, again, fundamentally change whether the practice, but there will need to be a human who has deep understanding of technical systems and business context in order to make these things work.

I think too that's interesting because

with the injection

of these LLMs and AIs and Copilots

into things like software engineering or even some of the, like, the Microsoft Copilot getting embedded into various office suites and Gemini trying to hook into all of the Google products. It highlights and accentuates

the work that data engineers have been doing in the background for so long because everybody's trying to use these AIs. And, ultimately, it works in some cases, but as soon as you try to go outside of the bounds of what the system was specifically built for, for instance, with if you're a software engineer and you're iterating with Copilot and your editor, and then all of a sudden you say, oh, I actually need to build something that touches on some other data system that isn't directly embedded in my application.

All of a sudden, you have a problem because the LLM doesn't know anything about it, and then you have to do your own little bit of data engineering to be able to pull the information into that context to be able to enable the AI to do what you want it to do. So Right. It's just kind of bringing more of that work into everybody else's day to day that up until then was just, oh, hey. I need this data, so I'm gonna go throw it over to the data team and ask them to do it for me.

Yep. Makes sense.

And so

digging now into what you're building at Daxter,

again, we've talked about the the fundamentals of it and some of its evolution in previous episodes.

But for people who maybe listened to the last episode, which was, I think, maybe a year or so ago, I'm wondering if you can give a bit of an overview of what has changed, some of the new stresses that

the evolution of the data ecosystem, the impact of AI has had on the ways that you think about data orchestration and the role of Daxter in the overall stack.

Yeah. So what's happened the last year

so I'll do it from a very, like, Daxter centric,

way since that's the universe that I'm in. What we see

is

people

doing much more advanced things with the orchestration system,

wanting to get much deeper observability

into their systems. And then also,

you know, we also see

a

more and more

centralized data platform teams and data engineers who are building frameworks for less technical stakeholders.

And that fundamentally changes their job

from building data pipelines directly

to dynamically building data pipelines based on what some other stakeholder wants them to do. And that kinda, like, led to the product developments that we're doing today.

You know, I think the other thing that we've seen

is that in a predictable fashion, the what I'll call the data hyperscalers,

Snowflake and Databricks, are beginning

to attempt to consolidate as much as possible and building tools in every single vertical. And so that is quite interesting.

The counterforce

to that

is the full embrace of open table formats,

which is a big trend and sort of the standardization

of the term lakehouse

to describe

data stacks that are built over these open table formats. So I think that's a huge megatrend as well that we're seeing. Icebreak is kind of similar

to AI in that it's kind of this, like, tsunami that is coming. But, you know, adoption is still, I would call, modest. But it it it will come, and it's pretty exciting.

And

in terms of Dagster itself, I know that in one of your recent releases, you introduced this concept of components where you're focusing on trying to modularize and standardize

the different

elements of

the transformation flow, allow for people to

be able to have reusable

and more quickly

instantiated

data assets based on particular

concepts and guardrails,

which is a fairly notable change to the way that the framework has worked up until now where if you wanted to do anything, you needed to dig into some Python code, figure out how it all wires together.

And I'm wondering how that has changed the ways that teams who are using Daxter, either up until now or in particular people who are newly onboarding onto Daxter,

how that changes the overall collaboration patterns

for people who are consuming data or working with data. I'm thinking in terms of data analysts, analytics engineers, but also application engineers

and how that changes the work to be done for these data platform teams and people who are closer into

the infrastructure and the technical details of the data pipelines.

Yeah. Lot to unpack there. The,

you know, the components,

has been in preview for a couple months, and we'll be releasing it to release candidate in July. So we've been working with select design partners

to work on it. I guess I'll start with what the trends we saw among our usage

of both us

and also data platform teams that were using other orchestrators. And we kinda, like, saw a bunch of patterns and converged on the single project, which you think addresses a bunch of the issues. I guess I mentioned it in the last question, but, like, tons of people are dynamically building data pipelines,

meaning that they're not directly just authoring tasks. They're not just directly building operators in airflow. They're not just writing the asset functions in Daxter. They are working at a higher level abstraction and rolling their own systems to programmatically generate those things based on higher level APIs that they present to their users. Okay. There's that. Many of them who are doing that

are doing that with a config driven or some sort of front end. Right? YAML, JSON, even, you know, persisting it in a database, you know, all sorts of stuff. With that generation, they also programmatically generate metadata

in a pie policy across their data platform.

And, you know, the I think the other thing is that the data orchestrators

are all introducing

more concepts.

So tasks,

assets, we have asset checks.

There's metadata concepts, sensors. You know, there's like a whole bunch of individualized abstractions.

Usually, when someone's interacting with the orchestrator, they often are integrating with a specific technology,

and they don't wanna think in terms of those lower level things.

For example,

DAGSTER, when it integrates with dbt,

ingest the entire model graph and surfaces each one as a software defined asset, which is kinda what what makes our dbt integration best in the business. It's code that generates those things. It programmatically

scrapes the model graph and dbt and code generates stuff. So and the job to be done for the orchestrator is, like, integrated with the project. And then lastly so there's a bunch of stuff I know. But and then lastly, they wanna be able to bring in more stakeholders with friendlier interfaces and ideally have AI friendly Cogent targets. So we saw all those trends,

programmatic generation of pipelines, config driven pipelines,

the desire for a high level abstraction, AI native cogent, and that's components is what came out of that, which

is kind of the project release. And I think the the way to think about it generally

is that it provides a integrated way with the framework to, in a principled way,

programmatically

generate

definitions. Right? And the killer use case for that typically

is a YAML front end that you can present to your stakeholders.

But I also wanna emphasize, there are lots of people who don't wanna program in YAML, and trust me, I understand completely.

I like types and turn complete languages and all sorts of stuff. So it's not just YAML. It's also a lightweight Python API on top of that. But in effect, you kind of separate

metadata

from the underlying

complicated code. And that metadata can be expressed

in YAML, right, or in very lightweight Python. Right? At the end, it's like Pydantic models. So you can program against that if you prefer it. But in that way, we have a a native way to programmically

build definitions in the framework,

and it changes for those of the Pythonistas

out there. It makes it so you defer definition generation

until after the Python import process is complete.

And I cannot emphasize how important that is to build reliable systems that dynamically generate these things. If you're using Airflow

or Daxter today, when you programmatically generate the definitions, it's happening at Python import time. And if you're talking to databases or doing something computationally expensive

or if you wanna unit test that thing, you do not want that to happen, actually.

So kind of like the core thing here is that components is composed ability abstraction that allows you to dynamically and in a deferred way load up definitions

that and by definitions, it ends up being the structure of the data pipelines. But the killer use case, I kind of was talking on highfalutin terms there. I think the killer use case

that people

understand and what's meeting meeting them where they are

is providing an integrated,

tool rich,

self documenting YAML DSL

with a pluggable back end for your users.

And it really is a lovely interface

between data platform engineers and their stakeholders.

So

as you're describing that and the philosophy around it, it puts me in mind a lot of things like the separation of concerns that Kubernetes is focused on, where you have the infrastructure and compute layer that the DevOps and the infrastructure engineers are responsible for and then the API and the the user space layer that people who are building applications that they want to deploy are integrating with. And so it gives a

shared infrastructure

with a clear delineation

between responsibilities

for people to be able to build on top of, which has then enabled a massive ecosystem of

other capabilities built across both of those dividing lines.

Another thing that comes to mind is the,

the the very declarative infrastructure as code

community that is built up around things like Terraform and Pulumi of you have cloud providers. All the cloud providers as you're deploying these resources have state that needs to be maintained and tracked.

And, similarly, in data engineering, you're building

complicated

resources that interact with each other in dynamic ways that are all dependent on state that you need to be able to understand and maintain and operate across. So I think that

what you're building with components

brings a lot of those same ideas into the space of actually building these data pipelines

where you can have that interface boundary between the platform team and the consumers thereof,

in a similar way as what we've done in the kind of cloud native ecosystem.

I couldn't have said it better myself, Tobias. That was great. No. Our data engineer who runs our own platform,

which is fairly substantial,

actually, now that we're a more mature SaaS business,

the way he expressed it, he's like,

finally, I have a front end for the data platform,

which I think is kind of a and that's what he was he's saying what you were saying, a much more simplified term where it's like, I manage all this

state and

but there's just this clear abstraction layer and there's a front end for it. And even when you're working by yourself,

it's very useful to have this

abstraction

so you can kind of switch your brain when you're working on stuff. And then there's also, like, a very concrete

advantage to slap in a bunch of metadata

in, like, a separate

spot

that either is Python with, like, no dependencies

or YAML, which can be loaded dynamically, is that you can, like, do syntax checks, like, very, very quickly.

So it can speed developer loops and CI a lot. If you're moving a lot of activity into, like, the Gammel or metadata space, the feedback loop's super super fast too. So there's kind of interesting product implications as well.

And then the topic that we, I think, have to touch on because every time somebody says, oh, I've got this great abstraction layer. It's gonna make everything easier. You don't have to worry about it, which puts me in mind a lot of the the various cycles of low code or no code tooling

is that everybody says, great. It'll be so much easier for you to build these things. I've worried about all the complex stuff for you,

and that works well to begin with. And then you have to start doing things that are specific to your problem domain, addressing edge cases, and then you start bumping up against the

capabilities of the system, and you end up having to just drop down to a lower and more complex level to be able to actually get the work done. And so I'm wondering how you've thought about that balance of making things very easy to use, opinionated,

constrained versus

maintaining the flexibility and adaptability that's necessary for such a complex domain.

Yeah.

And I think this comes from having a lot of experience.

Where things go wrong is

where people think

they can eliminate more complexity than they actually can with a framework.

And it imposes too many constraints

on itself,

and it's not sufficiently

customizable.

The reality

is that

every business

is complicated

and specific, and everyone is in their known context.

And so you can't know the complexity of all those things.

What you can do is provide tools and abstractions and infrastructure

so that

platform engineers and engineers in general

can subdivide

that complexity

into consumable,

understandable

parts.

And then the other thing a system can do

is provide

cross cutting complexity reduction that is domain neutral. So I think if you understand

those two things

and do that, you get this right balance

of having

thing that actually reduces the essential complexity of a program

as well as allows you to scale the program well.

So,

for example,

in, like, components,

we built all this tooling around the YAML front end. So there's, like, really nice

error messages.

You know, you can run-in CI. There's a CLI interface to it, all this stuff. That is just complexity that cuts across

all domains. Right? It's basically useful for anyone who's using that technology.

Then there's the other stuff about, like, how do you make it so that, you know, people can

kind of

put their take the complexity of their world

and put it into a containable chunk.

And that's why this is a Python native system with porous borders between YAML and Python,

where

the data platform engineers are empowered to build custom components,

right, that have a structured front end. They can package up this complexity, have it be self documenting, have a nice YAML front end for it. But we're not pretending,

like the job of building that custom component is not going to be difficult.

And it's not gonna be difficult because

it's not difficult because we're making it difficult. It's difficult because

it is difficult.

There are just problems in your world that we cannot know and that are complicated. And you are a smart person, and we we wanna get out of your way when you're doing that. But we want you to be able to capture that, like, complexity

in a nice consumable chunk and present it to your Upwork stakeholders.

So I think it's just like this knowing

what you can assume

control over and support and still providing

that flexibility

to the user of the framework so that it's adaptable to their own needs.

I think another interesting element of where we are in the industry, both data and otherwise,

is that the introduction

of generative AI and the capabilities that that engenders

has brought

the use of data

more fully into the space of application engineering,

where up until now, the application

had its own data that it cared about and that it maintained that data engineers would then

extract and rip out of context and then have to rebuild that context for various business use cases.

And now we've come full circle where the data across the organization

is now getting fed back into the application context via these LLMs and things like rag systems and fine tuning and needing to be able to do things like manage semantic memory for the LLMs, etcetera.

And so that means that application engineers need to be more aware of the data that exists within the organization to be able to power those use cases and more

empowered to be able to actually operate on that data to address the needs of the application

in the ways that these LLMs are using that context. And I'm wondering

how you're seeing that and the work that you're doing with components play out in terms of bringing application engineers

more into the space of operating across organizational

data and the interaction patterns that they have with data platform teams, data analysts, business stakeholders, etcetera.

I think in the AI the one way I think about it is that I think in the AI era and this was becoming more and more true as more data platform

assets were being incorporated into production app logic. But this is just gonna supercharge that, which is like and the phrase, I think it's, like, data engineering is becoming software engineering, and at the same time, software engineering is becoming also

partially data engineering. Because you need to do some data engineering on your application data in order to correctly feed it into things that feedback loops into your application. So I think there's two things in Daxter that help with that. One

is that we have developed this protocol

called pipes,

which allows you to invoke

Daxter native compute

in external programming languages in a super lightweight way.

So you we have pipes clients for

TypeScript,

Rust,

Java, I think couple other languages. I should probably I I know some users have done, like, some in c sharp. But, effectively, that allows a a user

to

write code in their native language, and then we provide lightweight APIs to stream back metadata

back to us. And we also

launch that process.

Well, we actually don't launch it. We, in a pluggable way, can inject context into that process so that they can get, like, what partition is being materialized or any other sort of config.

So we kind of have a back end protocol,

so you can write data processing logic

that needs to be in the data platform in the language of your choice. And then second of all,

components

is sort of the front end for that, meaning that when you're in the orchestrator

land

and you need to

connect

the business logic you wrote in some other programming language to where it needs to execute in the orchestrator and the metadata around that, you can use components

in order to kind of set that up.

And the goal of components is so that someone who's sort of, like, external to the data platform can sort of wander in

and do what they need to do without learning a complicated framework. They just kinda, like, see you know, they see where their teammate

put a similar thing. Maybe they copy and paste the file or they know the scaffolding command that scaffolds up the same thing, and then they can just, like, edit some configuration.

There's a type ahead in the editor, and there's embedded documentation.

They can verify it, and then they can go on their merry way. It's kinda like two sided. We want DAXTER to be a multilingual ecosystem,

be able to have a DAXTER native experience while having a very lightweight mechanism for doing that in other programming languages,

and then have a very easy way for a stakeholder to come in

and

incorporate and integrate their compute into the data platform.

One of the interesting things that I'm dealing with in my own usage of DAX store right now is that we have built up a set of pipelines,

asset definitions, etcetera,

that are running in production. They all do what we need them to do. But as the

use of AI

moves more into that application layer, the application engineers need to be able to operate on data and be able to fetch and transform data

in their own work. And so

I've stuck with figuring out, okay. Well, how do I onboard people into the data platform more easily? And so one of the

objectives

is to do what you said and turn the existing

repo of DAXTER code into more of a a set of platform

capabilities

that maybe get published as Python packages or what have you or these components

so that the application engineers can actually write their

asset logic

next to the code that they care about and, you know, their Django app or whatever it might be

so that there are no repository boundaries that they have to cross to be able to get their work done so that they don't have to have any

boundaries in terms of a hand off to another teammate just to be able to close the loop on the thing that they care about. And so I'm wondering how you're seeing this introduction of components and the

constructs that you've built up in Daxter up till now

enable

situations like that where you have that core capability of one team manages the

data ingest, manages

the definition of these assets, and then being able to

hand off those assets to another team, particularly when you're in the case of, oh, I'm running on my laptop, and I need to be able to make sure that all the pipeline does what I want it to do so that I can make sure that my feature works on this data the way that I presume without having to replicate all of the data across multiple different environment boundaries.

Yeah. There's a lot in that. You know, I think at a very basic

level,

there's embedded documentation capabilities

in Daxter

where,

you know, you can have long form descriptions that are marked down formatted that then appear

on kind of the home page

for the asset in the in Daxter, and a team can use that to establish a norm. They're like, hey. If you visit this home page, there's like a a little code snippet to know how to read it and something or a link to the right tool to read it.

Daxter itself doesn't kind of enforce

where or how you store anything. That's one of the other you know, it's an example of a problem where we're like, hey. The the world is complicated. We'll provide some built in integrations for stuff, but in the end, probably, you have to control what's going on. You know, in terms of I guess, you kinda spoke to two things, if to repeat what you were saying. One is kinda like I am

a application engineer,

and I wanna access the underlying dataset,

like, literally the table in Snowflake or something. Right?

That in in that way,

we are much more of just a nexus of metadata and documentation

that you can point your users to. And it can make it very smooth for you, the platform engineer, to add information to that because you can just, like, add stuff to your source,

add stuff into your repo, and then it gets exposed in a very accessible tool.

So that's cool.

Then there's the notion of the

application engineer actually interacting

with

the Daxter platform and adding stuff to it, maybe in a different repo. And right now, certainly, the only way that we you know, in that scenario that I talked about, the way that we would envision it is that even with components, they would still have to go into a repo and submit a PR and go through a process.

What's on our road map, however,

is in app editing of these component

YAML specifications

or the the front end, if you will. And we did a hackathon where we prototyped it, and it's, like, super exciting because we

think

because in the end,

from the user's perspective, it feels like in app editing. But in the background, it's actually submitting a PR in your path and triggering CI.

You know? But the user perspective will be, like, green, and then they can set save pretty much, and it'll submit the change to the platform.

And we think that is

super exciting. The other thing that this enables users to do, and we might go in this direction as well, because people do it already,

is they have

config systems that they expose

to users in their native repo, and then they set it up so that Daxter programmatically

fetches

those configs from elsewhere and dynamically constructs the pipelines. That's, like, another

approach that I think is, like, more advanced, but other teams are doing it today.

And, likely, we will support that in the as well in the future.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

And then the other interesting

work that you've done is alongside the work of components, you've introduced this new DG CLI.

You've introduced new ways of thinking about the structuring of your Daxter projects,

and I'm wondering

how that has

changed

the

patterns that you've seen teams building with Daxter

effectuate

and how that simplifies the work of being able to bootstrap new capabilities or new data assets within the overall platform

implementation? So I think what DJ really provides,

it's just a shorthand for the CLI,

is an opinionated

project layout

sort of inspired

by the

Ruby on Rails

style. You scaffold things. You don't have to manually import them. You can enforce conventions because you can customize the scaffolding.

So it creates the config file associated with the integration in the same exact spot in every single

place where you instantiate it, and it has some schema prefilled or something. It sounds simple, but it's actually, like, really

simplifies

dealing with it just reduces decision fatigue a lot. You know, concretely for DAGSTER

users,

the user friction we wanted to solve with this as well

is what

I call the import circus where, you know, if you have like, you split all your stuff among a bunch of modules.

In order to construct the DAGSTER definitions object at the root of your project,

you often have to do a bajillion imports, or then we have these facilities, which kind of, like, dynamically loaded symbols from another Python module.

And that was just a pain in the butt, and it made it really hard to reorganize a project. And we instead just manage that for you. So with the new project layout, it's just way more elegant both to is this less code importing stuff around? But very importantly, it makes it dramatically

easier to reorganize a project. Because, like, as you onboard new teams and stuff, just you wanna be able to move stuff around. And if you can make that seamless, that is great. It also makes it so that changes are far more localized

because only the file that moves gets changed, not like all the bajillion places that import. And that has a bunch of that's like a trivial way of describing it, but it has much side effects. One of the hard things about building one of these projects is, like, how do I organize it? Do I subdivide the repo between teams?

Do I do it by the DAGSTER abstraction,

or do I do it by some other

dimension? Like, it's a it's a multidimensional

problem. We found

is that it's much, much more elegant actually

to

subdivide the project

at the technology level.

Meaning that you have your dbt stuff here,

your Sling stuff here, your Fivetran stuff here, your Kubernetes

ad hoc jobs in this folder

because it allows you

to localize

all the technology

specific complexity

in a specific subfolder

and reuse it there. And then the people who are dealing with other parts of the platform don't see it or think about it or anything.

And, typically,

when you're working

and when you're writing code in

the data platform context, you're usually doing it in the context of a single technology.

Right? You're, like, going in and, like, I'm changing the dbt models or I'm changing this ingest.

And the only time when the cross technology stuff matters a lot is when you're doing integration testing, reviewing it in the UI.

So

you organize the code

by technology, but then you allow other cross cutting views in the UI or, like, in the output of the CLI tool. So I think that was an interesting insight. That is not obvious

when you kind of first start building one of these

platforms.

And

to that point of

slicing along the technology boundaries, I think that also lines up fairly well with the data asset constructs because any individual asset is largely going to be owned within a technology boundary. So a DBT model,

a table that is generated from an airbyte or a Fivetran ingest,

the

s three object that gets generated from some external process.

One of the other things that I've been building a lot around is the different resources, custom IO managers,

and I'm wondering how that factors into the ways that

the DG and the project scaffolding

thinks about

the breakdown of the system where maybe I have an IO manager specifically for handling

file based objects

in either object storage or local disk,

or I have a resource that has a base module that is an OAuth client, but then different implementations of that for different APIs and just how that gets used across these different

submodules within the DG project scaffolds.

Yeah. I think the the nice right now, if you don't use

the project layout

or you don't use these kind of more advanced APIs, which are a bit hidden,

then you're ended you generally end up with a global dictionary of resources at the top of your project, and we wanted to get rid of that. So what DG allows you to do, which aligns very much with this organized by technology thing, is place the resources that are, like, relevant

to the other things in that directory right next to it.

And then you have one spot

in your project that's like, okay. Here's all the stuff that deals with this technology. And that's been, like, a very

off, like, cognitive load

reduction

issue as well. But it also allows you to put one put it because it's hierarchical, you can put it at the right spot in the hierarchy. Because maybe you have some advanced resource to, like, talk to two technologies or something. Well, you put it at the parent folder and then because it makes sense for it to be there.

And then continuing on from the earlier conversation of the impact of AI and the work that you're doing with components and DG to

bring more

opinionated constructs and bring more scoping to the problem

is the impact that

the generative AI capabilities has had on the actual creation and maintenance of code and systems.

And I'm

wondering how the work that you're doing in Dagster is

designed to align with

the needs

of these AI systems for managing context, managing

input, and enabling data engineers and application engineers to generate more of this

automatically

without necessarily having to have as deep domain knowledge of either Dagster or the

specifics of the underlying technologies.

Yeah. 100%. I mean, we could probably talk for a few hours on this, so I'll try to keep it brief.

But I guess, first of all, to set context on this, I have a pretty specific view

of

how

you make a framework

optimized for AI cogent.

Because I think without proper constraints,

AI is a hallucinating

demon

that is a technical debt super spreader. It can be a serious problem, and it's very easy when you're doing AI cogen

to end up in a spot where you do not know

how it's working.

If there's a bug, the AI can't debug it, you can't debug it, and you're just in this unstable place, and you basically have to throw it away and start over. So I actually think it's very important to structure these systems

to allow for the code to be ephemeral

and disposable

if things go wrong. It's one of the reasons why components was designed that way is that there's, like, only one spot in the project that gets edited. And then if something goes wrong, you'd be like, okay. And because the cost of creation is so low, like, regenerating something isn't that big a deal. So I think these frameworks and AIs will have this reflexive relationship as they go along, and frameworks will have to be LLM native.

You know, I wrote this article last summer, actually, which influenced a bunch of my thinking

and or kind of explained it that led to components. And I call it there's this rise of what I'll call medium code.

Meaning, it's not low code or it's not, like, no code click up stuff, but it's not full software engineering.

It is you're writing a Turing complete language or a complex declarative language like SQL, but you're doing it in a highly constrained way.

And you're usually doing it in some coarse grained

container

of code.

In DBT, that's a model. In Jupyter Notebooks, that's a cell. And it limits the amount of context a human has to have in order to work properly.

But it turns out one of the most interesting things about AI is that what's good for the human is good for the machine and vice versa. Like, you want obvious APIs.

You want to limit the amount of context someone has to do their head or literally the number of tokens in the, that are in context in order to do stuff well. But the key thing as well is that the code that gets generated

needs to be precisely interpretable

by both a machine and a human. Because if you have to debug stuff or you have to bring someone in to help debug stuff, it needs to be something with deterministic behavior that is understandable.

So I won't go through the entire article

because that's a whole thing. But, basically, I think that the right AI cogent targets have coarse grain containers of code. They code to some high level framework or DSL that's precisely interpretable, but still part of the software development life cycle because that is absolutely essential because you need to have guardrails to check to make sure the generated code is correct. You need guardrails for the AI slop. So that's how I kinda consider how you need to think about how to design frameworks

for the AI native era.

And the components are designed with all this in mind, high level framework with built in documentation

that's customizable

by the user. I think documentation

will be viewed

increasingly

more as a store of context.

So documentation

needs to change,

meaning that the purpose of it is to provide context to the LLM. And so that's one of the reasons why we, like, focus so much on built in docs

for components and allowing

for custom component authors to inject the domain specific context right there in your code and be able to allow l m to scrape it.

Yeah. And then high quality error messages, critical to provide feedback.

But I think it's, like, just very important.

The concept of building a good AI native framework,

the goal is to dramatically

accelerate

the work of the software engineer,

not to abstract them away. And I don't design like that just because I like software engineers, and I don't want them to go away. That is true. So maybe there's some subconscious, you know, psychology going on there. But

more importantly, I think in essence, it is correct.

The reason why AI is so exciting, if done right, we can abstract away

so much of the drudgery,

so much of the toil of software engineering, and focus much more exclusively

on what you uniquely have judgment on. So to me, the most exciting thing about AI

is the ability

to abstract away

enormous

swaths of incidental complexity

that we didn't think was possible.

But

in order to do that

in a way that's effective and safe

and actually leads to higher quality system,

the framework designers

absolutely

need to optimize for it.

And, also,

the other element of working with these AIs effectively

is that it removes us from, to your point, the drudgery of dealing with boilerplate,

dealing with very narrowly scoped problems,

and moves us up to thinking about what is the actual system level requirement and how do I get there so that the LLM can focus on those very narrow domains to be able to stick them together in a way that is composable.

Yeah. Exactly. It's all very exciting. You know? I hope that people can approach it from a abundance mindset rather than the fear based mindset. But I think it's I think it's more that any it's a radical change even if it's gonna be from the better. And that is

stressful and anxiety inducing, and I totally get that. Like, I am in my own development, I am probably not as AI native as, like, I need to be. You know? And, you know, at this point, I'm an old man, so I need to really work on maintaining that brain plasticity to learn new stuff. So I get it, but,

it's also very exciting.

I take umbrage of that because I think we're the same age.

At this point in my life, I wake up very early, and my my son always asks me, Dan, are you up because you're an old man? I'm like, yes. That's that's my mom. My son is six. He's very charming.

And so as you have been

exposing these new capabilities, working with some of the early adopters for the DG, the scaffolding,

the components, interfaces, what are some of the most interesting or innovative or unexpected ways that you've seen those capabilities applied?

Yeah. Like I said, we're going with a fairly limited set of design partners. But even among that set, there's been a bunch of great stuff happening. One of our users is

onboarding

his

Databricks

kind of using stakeholders, data scientists who work in

notebooks hosted notebooks in the Databricks environment. And

he you know, the first thing he did was write his own custom component to and there's a REST API for running one of these things. Right? So you wrap a custom component around that. You can basically tell your data scientist, like, hey. If you wanna schedule it within the context of data platform, copy and paste the notebook URL, put in this YAML file, like, fill out these things. You're good to go. But then he realized the power

of the customizable config system, and he started kind of putting all the DevOps stuff in there too. So configuring memory,

how the integration with Datadog should work in the context of this thing, like, what metadata to put everywhere and stuff. So it ended up being this, like, kind of single

spot

where this intrepid data scientist integrating his notebook into the production data platform can control a bunch of different parameters of stuff. So I thought that was really cool. Another user

went kinda I would call it hog wild on the number of custom components he built, and he was all sorts of crazy legacy systems and talent and I think even inform

all this stuff. And so that was really heartening that someone was able to kinda churn out so many custom components as early in the life cycle of the system. And then another

one that comes to mind is that the moment that one of our users saw the new project layout and it's kind of hierarchical

nature, he also

saw this capability we have. We're kind of, like, in the

YAML file at any point in the hierarchy. You can kind of post process

all the definitions that were above it to, like, apply a common tag or apply the same metadata. And that allowed a really nice separation of responsibility

where the data platform engineer could programmatically

apply governance information across the entire

system very smoothly because it kind of changes the way that you can abstract things. Because you basically you tell your stakeholder, just put this tag on this asset. Okay? And then later down the line, the platform engineer can process that tag and then decide how to interpret it and, like, produce all sorts of derivative metadata and, like, who's the owner and what team is it on and, like, this piece of metadata indicates this policy, and I'm gonna write some other piece of code that queries that and makes decisions on it. So the fact that, like, the you know, you show the capability,

and then instantly,

the user is like, wow. I can use this to programmatically control all my governance in a smooth way. Like, that was really cool to see.

And then a natural outgrowth too of these components

is that it gives you a

reproducible and a packable

abstraction

that you could feasibly build a community library around of here are all the different components that people have published for their use cases so that somebody who is new to Dagster can come in and just pick and choose, carte blanche, whatever it is that they want to be able to LEGO brick their system together to get up and running.

That is the idea, and we didn't even talk about this beforehand, but you just you're setting me up perfectly. You know, one of the things we're building into this is the integrations marketplace

that we wanna be able to kind of index integrations

from both our own monorepo,

the community repo, as well as internally built components. So at the, you know, at a at scale customer, we want the centralized data platform team to kind of be able to publish components

into a searchable index that has all sorts of metadata and, like, embedded docs and, like, copy and pastable cost digits by, like, how to install this thing. So, yeah, we really wanna kick off an ecosystem effect for these things.

And as you have been

building these component capabilities,

working with teams, helping them come to grips with how to

accelerate their work with AI, how to build for AI with Dagster, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

That's a good question.

I think

it is

very

important

to balance

innovation

with change management,

and that can be very challenging. You have to be very careful

to introduce stuff incrementally,

to allow incremental process that provides value at every step along the way. Like, when you're adding new capabilities and you're asking users to do any sort of code change, you have to provide value instantaneously

every step along the way and then communicate about it properly. And that can be a challenge.

So there's that. And then the desire

to incorporate

stakeholders

and just how universal

the need is to incorporate people into the data engineering process

continues to kind of astound, actually. You know? Just data is so instrumental

to so many people's jobs, and that person who it's instrumental to is, like, the only person who understands it. So it's just allowing them to participate directly into the process

reduces so much context switching

and painful context switching and collaborative processes that there's just it ends up there's a huge organic demand for, that sort of thing. So I think that that's also been kind of a a lesson learned here.

And so as teams are trying

to tackle their data challenges, they're ramping up on AI use cases, what are the cases where Daxter is the wrong choice?

Well, we're fundamentally

a batch processing

system, you know, that can get to semi real time. But if you need microsecond latency,

like, we're not the right tool for that. Highly

dynamic,

you know, kind

of computations that require

loops and can't be structured

into a DAG structure.

Systems like temporal are more appropriate. They make different trade offs in terms of, you know, we provide much more structure and constraints built in lineage, all this stuff. And they have a much more complex state machine, but it is more generic,

more imperative, and more flexible. So there is that use case as well. So those are the two that come to mind.

And as you continue

to build and iterate on DAXTER, on components,

as you continue

to keep pace with the demands of the AI age that we're in, what are some of the things you have planned for the near to medium term of DAXTER that you're excited to dig into?

Well, you know, like I said, you know, obviously, I'm working with components, and I'm very excited for the, this kind of continuing the journey of expanding

that ecosystem.

And this in app editing is gonna be amazing. And it's in app editing that I can get into because it actually ends up still checking in the stuff source controlled, running tests on it. And that provides us nice layering so that it can make stakeholders handy happy, and it can make the engineers happy. Most importantly, it can get good outcome. So we have an entire long road map on the components front for that. In terms of other things I'm super excited about, you know, DAXR is in a unique spot because

we have meta information

on

the integrations.

We have metadata information

on the actual definitions defined in code. We have operational metadata. We have all sorts of very interesting metadata. And I think we can involve DAX or not just to for, like, a useful operational

tool, but as a generalized

context store for all sorts of tools, including our own across the platform. When we were kinda discussing this component stuff, it's like, okay. Yes. We're gonna design an abstraction that's good for native AI AI cogent, but what else why else do we have the right to win here? And we have the right to win in our opinion because we have a unique view

of all the context in the system across every tool, across the way that you define your pipelines and code, all sorts of stuff. So I'm very excited to

kind of, you know, not just have

AI

accelerate authoring

of data pipelines, but to have DAGSTER's contextual information power,

AI use cases of all sorts. And I think we're in a great place to do that. I kinda jokingly refer to it as the mother of all MCP servers because we can, like, aggregate

the MCPs of all our integrations,

like, all sorts of interesting stuff, ingest tons of information. Like, our DBT

integration, we ingest the full model code. Right? So we can provide that context directly in the same API where we can provide, like, the Python definitions of completely other different technologies,

as well as, like, information about when this last failed, as well as information about, like, hey, what things are upstream of this? So we have a great opportunity to be a really compelling

context store for AI tools operating on the data platform and across business.

Are there any other aspects of the age of AI, the impact

on engineering

practices, the work that you're doing at Daxter,

or any other related topics that we didn't discuss yet that you'd like to cover before we close out the show? Well, there is any number of topics I have opinions on. But, I feel like we have, you know, we've been talking for a while. I don't wanna overwhelm anyone. So I think we can,

wrap it up there. I guess, you know, it's changed,

but it is very, very exciting.

I might you know, I'm often a skeptic of such things, but, these AIs are really doing incredible things I wouldn't have believed were possible even a few years ago. So it's, pretty cool to see.

Yeah. I was skeptical at the outset as well, but I have been

reasonably impressed in recent months with some of the capabilities and the ways that it can accelerate work to be done. So definitely

Claude is really good at Cogent now, and I find that ChatTPT,

it's the o three model, is really incredible

for doing kind of research

on the Internet. And, like, I also love now how it's, like, showing you what it's doing. Like, hey. I'm fetching from here. I'm fetching just that that transparency actually builds a lot of trust, so you kind of can guess that it's doing stuff correctly. So, yeah, the the use cases are pretty incredible.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your contact information to the show notes. And for the final question, I'd like to get your perspective on what you currently see as being the biggest gap in the tooling or technology that's available for data management today.

Ugh. You really had to ask me that. The,

well, it's hard coming on as a vendor and asking that because, of course, the thing I'm working on is,

is the most important thing. I guess, to me, the missing thing

is that all of the hyperscalers

and all the data hyperscalers,

they are all trying to build walled gardens.

And we're maybe going back to a world where

there are Databricks developers and Snowflake developers. And more importantly, there are Databricks companies and Snowflake companies and AWS companies and GCP companies. And I think that is a bad state of the world for the engineers

because you want people to be able to move flexibly between different companies and have skills be portable,

and we should really still be striving for a world of open standards. So, you know, I don't know if Daxter is the right tool. Well, I I mean, I think we can play a part, but I think there's other parts in the ecosystem that need need to step up too. But I hope we live in a world that's less vertically integrated and more horizontally

integrated.

And kinda like anyone who can help out with that by building standards, building open source tools, making that story better, that is great. Because I think, like, a a world of five walled gardens is kind of a sad one.

Yeah. Absolutely.

Well, thank you very much for taking the time today to join me as usual and for all the great work that you're doing on Daxter and making that easier for folks to adapt to the changing ecosystem. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It means a lot. Appreciate you having me on the

show.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast