Summary
In this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insights on how it will ultimately enhance productivity and expand software engineering's scope. He delves into the current state of AI adoption, the importance of maintaining core data engineering principles, and the need for human oversight when leveraging AI tools effectively. Nick also introduces Dagster's new components feature, designed to modularize and standardize data transformation processes, making it easier for teams to collaborate and integrate AI into their workflows. Join in to explore the future of data engineering, the potential for AI to abstract away complexity, and the importance of open standards in preventing walled gardens in the tech industry.
Announcements
Parting Question
In this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insights on how it will ultimately enhance productivity and expand software engineering's scope. He delves into the current state of AI adoption, the importance of maintaining core data engineering principles, and the need for human oversight when leveraging AI tools effectively. Nick also introduces Dagster's new components feature, designed to modularize and standardize data transformation processes, making it easier for teams to collaborate and integrate AI into their workflows. Join in to explore the future of data engineering, the potential for AI to abstract away complexity, and the importance of open standards in preventing walled gardens in the tech industry.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.
- Your host is Tobias Macey and today I'm interviewing Nick Schrock about lowering the barrier to entry for data platform consumers
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving your summary of the impact that the tidal wave of AI has had on data platforms and data teams?
- For anyone who hasn't heard of Dagster, can you give a quick summary of the project?
- What are the notable changes in the Dagster project in the past year?
- What are the ecosystem pressures that have shaped the ways that you think about the features and trajectory of Dagster as a project/product/community?
- In your recent release you introduced "components", which is a substantial change in how you enable teams to collaborate on data problems. What was the motivating factor in that work and how does it change the ways that organizations engage with their data?
- tension between being flexible and extensible vs. opinionated and constrained
- increased dependency on orchestration with LLM use cases
- reducing the barrier to contribution for data platform/pipelines
- bringing application engineers into the mix
- challenges of meeting users/teams where they are (languages, platform investments, etc.)
- What are the most interesting, innovative, or unexpected ways that you have seen teams applying the Components pattern?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the latest iterations of Dagster?
- When is Dagster the wrong choice?
- What do you have planned for the future of Dagster?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Dagster+ Episode
- Dagster Components Slide Deck
- The Rise Of Medium Code
- Lakehouse Architecture
- Iceberg
- Dagster Components
- Pydantic Models
- Kubernetes
- Dagster Pipes
- Ruby on Rails
- dbt
- Sling
- Fivetran
- Temporal
- MCP == Model Context Protocol
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about Soda. With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds.
And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda to sign up and follow soda's launch week, which starts on June 9. Your host is Tobias Macy, and today I'm welcoming back Nick Schrock to talk about lowering the barrier to entry for data platform consumers and the impact that the current era of AI has had on data engineering. And so, Nick, for folks who haven't heard you before in your numerous past appearances, can you give a quick introduction?
[00:02:21] Nick Schrock:
Yeah. Tobias, thanks for having me back. It's always a pleasure. So my name is Nick Schrock. I'm the CTO and founder of Daxter Labs. We're the company behind Daxter, which is a data orchestration framework. And, yeah, kinda my background is I cut my teeth at Facebook engineering. And the project I was most known for prior to Dagster was GraphQL, which I initially created, and we open sourced it. Well, first, we built it inside Facebook, and then we open sourced it. And it become a broadly used technology. But then I moved on to data engineering and data platforms, and I've been working on Daxer for quite a while now.
[00:02:58] Tobias Macey:
And so we don't need to get too much into the usual flow because we've covered that in past episodes. But I think that given the current state of the industry and all of the hype around AI, I'm wondering if you can just start by giving your summary of the impact that the overall adoption and growth of AI and automation and agents has had on data platforms and data teams?
[00:03:26] Nick Schrock:
Yeah. That's an interesting question. I guess, just to set a framework for this, I think there are varying degrees of how people view AI. I like to joke around and call it, like there's, like, AI boomers and AI doomers. And AI boomers are like, oh, we've seen this before, blah blah blah. And AI doomers are like, this is the end of all human labor. It's going to zero. We should just bomb the data centers, literally. And I am I would call it squarely in the middle. I think it's gonna be incredibly disruptive. I think it's gonna be as important as, say, the transition from the industrial to the information age or the, pre industrial to industrial age.
But I think that it will be a massive productivity boost to lots of people, including including engineers. And far be it from software engineers going away. I actually think it will expand the number of people writing software and will make them more leveraged. So, you know, in terms of its impact on the software engineering in this industry, I'm very far from being a a doomer. I think it will be a renaissance in software engineering, and it's super exciting, but it will fundamentally change that. In terms of its impact on the data platform space specifically, I think in reality in the day to day lives of practitioners working in data platforms, it's kind of like there's been an earthquake.
There's a tsunami out there, but it hasn't really hit shore yet. And what I mean by that is that I think lots of people are using AI tools to write software in an accelerated manner. I think lots of people are starting to work on AI projects at their various organizations, you know, especially use cases involving structured data. And I think some of their tools outside of their code editor do have AI features, but I don't think it has fundamentally changed their world yet. So I think everyone's kind of waiting for it and, you know, figuring out how to adjust to this new future.
But I think it's like the we're at the beginning of the inflection point, so it's kind of this odd state for most people where their day to day hasn't changed that much, but they know it's going to in some time horizon. I don't know if that resonates with you. But
[00:05:58] Tobias Macey:
Yeah. It absolutely does. And I was actually, recently giving a presentation for another company to their engineering team because they wanted to get my thoughts on the future of data engineering, the impact of AI. And I think that there's definitely a lot of new work to be done, but the fundamentals of the work don't change. There's definitely a lot of changing in terms of the specifics of the tooling, but in principle, the job of data engineers and and the way that I distilled it for the presentation was that the role of data engineers is to turn raw information into knowledge and enable the business to make use of that knowledge to either make better products, power the applications that they run, make better decisions, etcetera.
And using data to power these different AI systems maybe takes a different shape because you're bringing in, vector index versus just a star schema, or maybe you're able to unlock more value out of the unstructured data that you've been storing since Hadoop hit the scene. But, ultimately, the core fundamentals of the job are the same where you find data that can be used for something. You run it through some sort of transformation. You get it into a manner where you imbue context and semantics into the data beyond just the raw bits, and and then you feed that into some downstream system, whether that's business intelligence or an LLM or web dashboard or whatever might be the case. So the work is fundamentally the same. It's just the shape of it is evolving a little bit, and the speed is probably increasing. And I think that there there's another interesting aspect of it where the way that I build the distinction is that you're either building for AI where you're using data to feed it into an LLM or you're building with AI where you're actually using the LLM to generate that transformation and help you iterate faster and find anomalies, etcetera.
[00:07:56] Nick Schrock:
Yeah. That's right. That makes sense. You know? An analogy I like to use is that when accountants first saw the spreadsheet, they were probably like, it's over. Right? And when people saw calculators, they were like, oh, no one's gonna need to know math anymore. That's fundamentally not true. You need to know the principles in order to evaluate and use the tools. I do think that computer you know, data engineering especially is so critical, and evaluating correctness requires so much business context and localized context that it will, again, fundamentally change whether the practice, but there will need to be a human who has deep understanding of technical systems and business context in order to make these things work.
[00:08:41] Tobias Macey:
I think too that's interesting because with the injection of these LLMs and AIs and Copilots into things like software engineering or even some of the, like, the Microsoft Copilot getting embedded into various office suites and Gemini trying to hook into all of the Google products. It highlights and accentuates the work that data engineers have been doing in the background for so long because everybody's trying to use these AIs. And, ultimately, it works in some cases, but as soon as you try to go outside of the bounds of what the system was specifically built for, for instance, with if you're a software engineer and you're iterating with Copilot and your editor, and then all of a sudden you say, oh, I actually need to build something that touches on some other data system that isn't directly embedded in my application. All of a sudden, you have a problem because the LLM doesn't know anything about it, and then you have to do your own little bit of data engineering to be able to pull the information into that context to be able to enable the AI to do what you want it to do. So Right. It's just kind of bringing more of that work into everybody else's day to day that up until then was just, oh, hey. I need this data, so I'm gonna go throw it over to the data team and ask them to do it for me.
[00:09:53] Nick Schrock:
Yep. Makes sense.
[00:09:56] Tobias Macey:
And so digging now into what you're building at Daxter, again, we've talked about the the fundamentals of it and some of its evolution in previous episodes. But for people who maybe listened to the last episode, which was, I think, maybe a year or so ago, I'm wondering if you can give a bit of an overview of what has changed, some of the new stresses that the evolution of the data ecosystem, the impact of AI has had on the ways that you think about data orchestration and the role of Daxter in the overall stack.
[00:10:29] Nick Schrock:
Yeah. So what's happened the last year so I'll do it from a very, like, Daxter centric, way since that's the universe that I'm in. What we see is people doing much more advanced things with the orchestration system, wanting to get much deeper observability into their systems. And then also, you know, we also see a more and more centralized data platform teams and data engineers who are building frameworks for less technical stakeholders. And that fundamentally changes their job from building data pipelines directly to dynamically building data pipelines based on what some other stakeholder wants them to do. And that kinda, like, led to the product developments that we're doing today.
You know, I think the other thing that we've seen is that in a predictable fashion, the what I'll call the data hyperscalers, Snowflake and Databricks, are beginning to attempt to consolidate as much as possible and building tools in every single vertical. And so that is quite interesting. The counterforce to that is the full embrace of open table formats, which is a big trend and sort of the standardization of the term lakehouse to describe data stacks that are built over these open table formats. So I think that's a huge megatrend as well that we're seeing. Icebreak is kind of similar to AI in that it's kind of this, like, tsunami that is coming. But, you know, adoption is still, I would call, modest. But it it it will come, and it's pretty exciting.
[00:12:21] Tobias Macey:
And in terms of Dagster itself, I know that in one of your recent releases, you introduced this concept of components where you're focusing on trying to modularize and standardize the different elements of the transformation flow, allow for people to be able to have reusable and more quickly instantiated data assets based on particular concepts and guardrails, which is a fairly notable change to the way that the framework has worked up until now where if you wanted to do anything, you needed to dig into some Python code, figure out how it all wires together. And I'm wondering how that has changed the ways that teams who are using Daxter, either up until now or in particular people who are newly onboarding onto Daxter, how that changes the overall collaboration patterns for people who are consuming data or working with data. I'm thinking in terms of data analysts, analytics engineers, but also application engineers and how that changes the work to be done for these data platform teams and people who are closer into the infrastructure and the technical details of the data pipelines.
[00:13:35] Nick Schrock:
Yeah. Lot to unpack there. The, you know, the components, has been in preview for a couple months, and we'll be releasing it to release candidate in July. So we've been working with select design partners to work on it. I guess I'll start with what the trends we saw among our usage of both us and also data platform teams that were using other orchestrators. And we kinda, like, saw a bunch of patterns and converged on the single project, which you think addresses a bunch of the issues. I guess I mentioned it in the last question, but, like, tons of people are dynamically building data pipelines, meaning that they're not directly just authoring tasks. They're not just directly building operators in airflow. They're not just writing the asset functions in Daxter. They are working at a higher level abstraction and rolling their own systems to programmatically generate those things based on higher level APIs that they present to their users. Okay. There's that. Many of them who are doing that are doing that with a config driven or some sort of front end. Right? YAML, JSON, even, you know, persisting it in a database, you know, all sorts of stuff. With that generation, they also programmatically generate metadata in a pie policy across their data platform.
And, you know, the I think the other thing is that the data orchestrators are all introducing more concepts. So tasks, assets, we have asset checks. There's metadata concepts, sensors. You know, there's like a whole bunch of individualized abstractions. Usually, when someone's interacting with the orchestrator, they often are integrating with a specific technology, and they don't wanna think in terms of those lower level things. For example, DAGSTER, when it integrates with dbt, ingest the entire model graph and surfaces each one as a software defined asset, which is kinda what what makes our dbt integration best in the business. It's code that generates those things. It programmatically scrapes the model graph and dbt and code generates stuff. So and the job to be done for the orchestrator is, like, integrated with the project. And then lastly so there's a bunch of stuff I know. But and then lastly, they wanna be able to bring in more stakeholders with friendlier interfaces and ideally have AI friendly Cogent targets. So we saw all those trends, programmatic generation of pipelines, config driven pipelines, the desire for a high level abstraction, AI native cogent, and that's components is what came out of that, which is kind of the project release. And I think the the way to think about it generally is that it provides a integrated way with the framework to, in a principled way, programmatically generate definitions. Right? And the killer use case for that typically is a YAML front end that you can present to your stakeholders.
But I also wanna emphasize, there are lots of people who don't wanna program in YAML, and trust me, I understand completely. I like types and turn complete languages and all sorts of stuff. So it's not just YAML. It's also a lightweight Python API on top of that. But in effect, you kind of separate metadata from the underlying complicated code. And that metadata can be expressed in YAML, right, or in very lightweight Python. Right? At the end, it's like Pydantic models. So you can program against that if you prefer it. But in that way, we have a a native way to programmically build definitions in the framework, and it changes for those of the Pythonistas out there. It makes it so you defer definition generation until after the Python import process is complete.
And I cannot emphasize how important that is to build reliable systems that dynamically generate these things. If you're using Airflow or Daxter today, when you programmatically generate the definitions, it's happening at Python import time. And if you're talking to databases or doing something computationally expensive or if you wanna unit test that thing, you do not want that to happen, actually. So kind of like the core thing here is that components is composed ability abstraction that allows you to dynamically and in a deferred way load up definitions that and by definitions, it ends up being the structure of the data pipelines. But the killer use case, I kind of was talking on highfalutin terms there. I think the killer use case that people understand and what's meeting meeting them where they are is providing an integrated, tool rich, self documenting YAML DSL with a pluggable back end for your users.
And it really is a lovely interface between data platform engineers and their stakeholders.
[00:18:21] Tobias Macey:
So as you're describing that and the philosophy around it, it puts me in mind a lot of things like the separation of concerns that Kubernetes is focused on, where you have the infrastructure and compute layer that the DevOps and the infrastructure engineers are responsible for and then the API and the the user space layer that people who are building applications that they want to deploy are integrating with. And so it gives a shared infrastructure with a clear delineation between responsibilities for people to be able to build on top of, which has then enabled a massive ecosystem of other capabilities built across both of those dividing lines.
Another thing that comes to mind is the, the the very declarative infrastructure as code community that is built up around things like Terraform and Pulumi of you have cloud providers. All the cloud providers as you're deploying these resources have state that needs to be maintained and tracked. And, similarly, in data engineering, you're building complicated resources that interact with each other in dynamic ways that are all dependent on state that you need to be able to understand and maintain and operate across. So I think that what you're building with components brings a lot of those same ideas into the space of actually building these data pipelines where you can have that interface boundary between the platform team and the consumers thereof, in a similar way as what we've done in the kind of cloud native ecosystem.
[00:19:52] Nick Schrock:
I couldn't have said it better myself, Tobias. That was great. No. Our data engineer who runs our own platform, which is fairly substantial, actually, now that we're a more mature SaaS business, the way he expressed it, he's like, finally, I have a front end for the data platform, which I think is kind of a and that's what he was he's saying what you were saying, a much more simplified term where it's like, I manage all this state and but there's just this clear abstraction layer and there's a front end for it. And even when you're working by yourself, it's very useful to have this abstraction so you can kind of switch your brain when you're working on stuff. And then there's also, like, a very concrete advantage to slap in a bunch of metadata in, like, a separate spot that either is Python with, like, no dependencies or YAML, which can be loaded dynamically, is that you can, like, do syntax checks, like, very, very quickly.
So it can speed developer loops and CI a lot. If you're moving a lot of activity into, like, the Gammel or metadata space, the feedback loop's super super fast too. So there's kind of interesting product implications as well.
[00:21:02] Tobias Macey:
And then the topic that we, I think, have to touch on because every time somebody says, oh, I've got this great abstraction layer. It's gonna make everything easier. You don't have to worry about it, which puts me in mind a lot of the the various cycles of low code or no code tooling is that everybody says, great. It'll be so much easier for you to build these things. I've worried about all the complex stuff for you, and that works well to begin with. And then you have to start doing things that are specific to your problem domain, addressing edge cases, and then you start bumping up against the capabilities of the system, and you end up having to just drop down to a lower and more complex level to be able to actually get the work done. And so I'm wondering how you've thought about that balance of making things very easy to use, opinionated, constrained versus maintaining the flexibility and adaptability that's necessary for such a complex domain.
[00:21:55] Nick Schrock:
Yeah. And I think this comes from having a lot of experience. Where things go wrong is where people think they can eliminate more complexity than they actually can with a framework. And it imposes too many constraints on itself, and it's not sufficiently customizable. The reality is that every business is complicated and specific, and everyone is in their known context. And so you can't know the complexity of all those things. What you can do is provide tools and abstractions and infrastructure so that platform engineers and engineers in general can subdivide that complexity into consumable, understandable parts.
And then the other thing a system can do is provide cross cutting complexity reduction that is domain neutral. So I think if you understand those two things and do that, you get this right balance of having thing that actually reduces the essential complexity of a program as well as allows you to scale the program well. So, for example, in, like, components, we built all this tooling around the YAML front end. So there's, like, really nice error messages. You know, you can run-in CI. There's a CLI interface to it, all this stuff. That is just complexity that cuts across all domains. Right? It's basically useful for anyone who's using that technology.
Then there's the other stuff about, like, how do you make it so that, you know, people can kind of put their take the complexity of their world and put it into a containable chunk. And that's why this is a Python native system with porous borders between YAML and Python, where the data platform engineers are empowered to build custom components, right, that have a structured front end. They can package up this complexity, have it be self documenting, have a nice YAML front end for it. But we're not pretending, like the job of building that custom component is not going to be difficult.
And it's not gonna be difficult because it's not difficult because we're making it difficult. It's difficult because it is difficult. There are just problems in your world that we cannot know and that are complicated. And you are a smart person, and we we wanna get out of your way when you're doing that. But we want you to be able to capture that, like, complexity in a nice consumable chunk and present it to your Upwork stakeholders. So I think it's just like this knowing what you can assume control over and support and still providing that flexibility to the user of the framework so that it's adaptable to their own needs.
[00:24:50] Tobias Macey:
I think another interesting element of where we are in the industry, both data and otherwise, is that the introduction of generative AI and the capabilities that that engenders has brought the use of data more fully into the space of application engineering, where up until now, the application had its own data that it cared about and that it maintained that data engineers would then extract and rip out of context and then have to rebuild that context for various business use cases. And now we've come full circle where the data across the organization is now getting fed back into the application context via these LLMs and things like rag systems and fine tuning and needing to be able to do things like manage semantic memory for the LLMs, etcetera.
And so that means that application engineers need to be more aware of the data that exists within the organization to be able to power those use cases and more empowered to be able to actually operate on that data to address the needs of the application in the ways that these LLMs are using that context. And I'm wondering how you're seeing that and the work that you're doing with components play out in terms of bringing application engineers more into the space of operating across organizational data and the interaction patterns that they have with data platform teams, data analysts, business stakeholders, etcetera.
[00:26:18] Nick Schrock:
I think in the AI the one way I think about it is that I think in the AI era and this was becoming more and more true as more data platform assets were being incorporated into production app logic. But this is just gonna supercharge that, which is like and the phrase, I think it's, like, data engineering is becoming software engineering, and at the same time, software engineering is becoming also partially data engineering. Because you need to do some data engineering on your application data in order to correctly feed it into things that feedback loops into your application. So I think there's two things in Daxter that help with that. One is that we have developed this protocol called pipes, which allows you to invoke Daxter native compute in external programming languages in a super lightweight way.
So you we have pipes clients for TypeScript, Rust, Java, I think couple other languages. I should probably I I know some users have done, like, some in c sharp. But, effectively, that allows a a user to write code in their native language, and then we provide lightweight APIs to stream back metadata back to us. And we also launch that process. Well, we actually don't launch it. We, in a pluggable way, can inject context into that process so that they can get, like, what partition is being materialized or any other sort of config. So we kind of have a back end protocol, so you can write data processing logic that needs to be in the data platform in the language of your choice. And then second of all, components is sort of the front end for that, meaning that when you're in the orchestrator land and you need to connect the business logic you wrote in some other programming language to where it needs to execute in the orchestrator and the metadata around that, you can use components in order to kind of set that up.
And the goal of components is so that someone who's sort of, like, external to the data platform can sort of wander in and do what they need to do without learning a complicated framework. They just kinda, like, see you know, they see where their teammate put a similar thing. Maybe they copy and paste the file or they know the scaffolding command that scaffolds up the same thing, and then they can just, like, edit some configuration. There's a type ahead in the editor, and there's embedded documentation. They can verify it, and then they can go on their merry way. It's kinda like two sided. We want DAXTER to be a multilingual ecosystem, be able to have a DAXTER native experience while having a very lightweight mechanism for doing that in other programming languages, and then have a very easy way for a stakeholder to come in and incorporate and integrate their compute into the data platform.
[00:29:06] Tobias Macey:
One of the interesting things that I'm dealing with in my own usage of DAX store right now is that we have built up a set of pipelines, asset definitions, etcetera, that are running in production. They all do what we need them to do. But as the use of AI moves more into that application layer, the application engineers need to be able to operate on data and be able to fetch and transform data in their own work. And so I've stuck with figuring out, okay. Well, how do I onboard people into the data platform more easily? And so one of the objectives is to do what you said and turn the existing repo of DAXTER code into more of a a set of platform capabilities that maybe get published as Python packages or what have you or these components so that the application engineers can actually write their asset logic next to the code that they care about and, you know, their Django app or whatever it might be so that there are no repository boundaries that they have to cross to be able to get their work done so that they don't have to have any boundaries in terms of a hand off to another teammate just to be able to close the loop on the thing that they care about. And so I'm wondering how you're seeing this introduction of components and the constructs that you've built up in Daxter up till now enable situations like that where you have that core capability of one team manages the data ingest, manages the definition of these assets, and then being able to hand off those assets to another team, particularly when you're in the case of, oh, I'm running on my laptop, and I need to be able to make sure that all the pipeline does what I want it to do so that I can make sure that my feature works on this data the way that I presume without having to replicate all of the data across multiple different environment boundaries.
[00:30:58] Nick Schrock:
Yeah. There's a lot in that. You know, I think at a very basic level, there's embedded documentation capabilities in Daxter where, you know, you can have long form descriptions that are marked down formatted that then appear on kind of the home page for the asset in the in Daxter, and a team can use that to establish a norm. They're like, hey. If you visit this home page, there's like a a little code snippet to know how to read it and something or a link to the right tool to read it. Daxter itself doesn't kind of enforce where or how you store anything. That's one of the other you know, it's an example of a problem where we're like, hey. The the world is complicated. We'll provide some built in integrations for stuff, but in the end, probably, you have to control what's going on. You know, in terms of I guess, you kinda spoke to two things, if to repeat what you were saying. One is kinda like I am a application engineer, and I wanna access the underlying dataset, like, literally the table in Snowflake or something. Right?
That in in that way, we are much more of just a nexus of metadata and documentation that you can point your users to. And it can make it very smooth for you, the platform engineer, to add information to that because you can just, like, add stuff to your source, add stuff into your repo, and then it gets exposed in a very accessible tool. So that's cool. Then there's the notion of the application engineer actually interacting with the Daxter platform and adding stuff to it, maybe in a different repo. And right now, certainly, the only way that we you know, in that scenario that I talked about, the way that we would envision it is that even with components, they would still have to go into a repo and submit a PR and go through a process. What's on our road map, however, is in app editing of these component YAML specifications or the the front end, if you will. And we did a hackathon where we prototyped it, and it's, like, super exciting because we think because in the end, from the user's perspective, it feels like in app editing. But in the background, it's actually submitting a PR in your path and triggering CI.
You know? But the user perspective will be, like, green, and then they can set save pretty much, and it'll submit the change to the platform. And we think that is super exciting. The other thing that this enables users to do, and we might go in this direction as well, because people do it already, is they have config systems that they expose to users in their native repo, and then they set it up so that Daxter programmatically fetches those configs from elsewhere and dynamically constructs the pipelines. That's, like, another approach that I think is, like, more advanced, but other teams are doing it today.
And, likely, we will support that in the as well in the future.
[00:34:03] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
And then the other interesting work that you've done is alongside the work of components, you've introduced this new DG CLI. You've introduced new ways of thinking about the structuring of your Daxter projects, and I'm wondering how that has changed the patterns that you've seen teams building with Daxter effectuate and how that simplifies the work of being able to bootstrap new capabilities or new data assets within the overall platform
[00:35:11] Nick Schrock:
implementation? So I think what DJ really provides, it's just a shorthand for the CLI, is an opinionated project layout sort of inspired by the Ruby on Rails style. You scaffold things. You don't have to manually import them. You can enforce conventions because you can customize the scaffolding. So it creates the config file associated with the integration in the same exact spot in every single place where you instantiate it, and it has some schema prefilled or something. It sounds simple, but it's actually, like, really simplifies dealing with it just reduces decision fatigue a lot. You know, concretely for DAGSTER users, the user friction we wanted to solve with this as well is what I call the import circus where, you know, if you have like, you split all your stuff among a bunch of modules.
In order to construct the DAGSTER definitions object at the root of your project, you often have to do a bajillion imports, or then we have these facilities, which kind of, like, dynamically loaded symbols from another Python module. And that was just a pain in the butt, and it made it really hard to reorganize a project. And we instead just manage that for you. So with the new project layout, it's just way more elegant both to is this less code importing stuff around? But very importantly, it makes it dramatically easier to reorganize a project. Because, like, as you onboard new teams and stuff, just you wanna be able to move stuff around. And if you can make that seamless, that is great. It also makes it so that changes are far more localized because only the file that moves gets changed, not like all the bajillion places that import. And that has a bunch of that's like a trivial way of describing it, but it has much side effects. One of the hard things about building one of these projects is, like, how do I organize it? Do I subdivide the repo between teams?
Do I do it by the DAGSTER abstraction, or do I do it by some other dimension? Like, it's a it's a multidimensional problem. We found is that it's much, much more elegant actually to subdivide the project at the technology level. Meaning that you have your dbt stuff here, your Sling stuff here, your Fivetran stuff here, your Kubernetes ad hoc jobs in this folder because it allows you to localize all the technology specific complexity in a specific subfolder and reuse it there. And then the people who are dealing with other parts of the platform don't see it or think about it or anything. And, typically, when you're working and when you're writing code in the data platform context, you're usually doing it in the context of a single technology.
Right? You're, like, going in and, like, I'm changing the dbt models or I'm changing this ingest. And the only time when the cross technology stuff matters a lot is when you're doing integration testing, reviewing it in the UI. So you organize the code by technology, but then you allow other cross cutting views in the UI or, like, in the output of the CLI tool. So I think that was an interesting insight. That is not obvious when you kind of first start building one of these platforms.
[00:38:38] Tobias Macey:
And to that point of slicing along the technology boundaries, I think that also lines up fairly well with the data asset constructs because any individual asset is largely going to be owned within a technology boundary. So a DBT model, a table that is generated from an airbyte or a Fivetran ingest, the s three object that gets generated from some external process. One of the other things that I've been building a lot around is the different resources, custom IO managers, and I'm wondering how that factors into the ways that the DG and the project scaffolding thinks about the breakdown of the system where maybe I have an IO manager specifically for handling file based objects in either object storage or local disk, or I have a resource that has a base module that is an OAuth client, but then different implementations of that for different APIs and just how that gets used across these different submodules within the DG project scaffolds.
[00:39:48] Nick Schrock:
Yeah. I think the the nice right now, if you don't use the project layout or you don't use these kind of more advanced APIs, which are a bit hidden, then you're ended you generally end up with a global dictionary of resources at the top of your project, and we wanted to get rid of that. So what DG allows you to do, which aligns very much with this organized by technology thing, is place the resources that are, like, relevant to the other things in that directory right next to it. And then you have one spot in your project that's like, okay. Here's all the stuff that deals with this technology. And that's been, like, a very off, like, cognitive load reduction issue as well. But it also allows you to put one put it because it's hierarchical, you can put it at the right spot in the hierarchy. Because maybe you have some advanced resource to, like, talk to two technologies or something. Well, you put it at the parent folder and then because it makes sense for it to be there.
[00:40:50] Tobias Macey:
And then continuing on from the earlier conversation of the impact of AI and the work that you're doing with components and DG to bring more opinionated constructs and bring more scoping to the problem is the impact that the generative AI capabilities has had on the actual creation and maintenance of code and systems. And I'm wondering how the work that you're doing in Dagster is designed to align with the needs of these AI systems for managing context, managing input, and enabling data engineers and application engineers to generate more of this automatically without necessarily having to have as deep domain knowledge of either Dagster or the specifics of the underlying technologies.
[00:41:44] Nick Schrock:
Yeah. 100%. I mean, we could probably talk for a few hours on this, so I'll try to keep it brief. But I guess, first of all, to set context on this, I have a pretty specific view of how you make a framework optimized for AI cogent. Because I think without proper constraints, AI is a hallucinating demon that is a technical debt super spreader. It can be a serious problem, and it's very easy when you're doing AI cogen to end up in a spot where you do not know how it's working. If there's a bug, the AI can't debug it, you can't debug it, and you're just in this unstable place, and you basically have to throw it away and start over. So I actually think it's very important to structure these systems to allow for the code to be ephemeral and disposable if things go wrong. It's one of the reasons why components was designed that way is that there's, like, only one spot in the project that gets edited. And then if something goes wrong, you'd be like, okay. And because the cost of creation is so low, like, regenerating something isn't that big a deal. So I think these frameworks and AIs will have this reflexive relationship as they go along, and frameworks will have to be LLM native.
You know, I wrote this article last summer, actually, which influenced a bunch of my thinking and or kind of explained it that led to components. And I call it there's this rise of what I'll call medium code. Meaning, it's not low code or it's not, like, no code click up stuff, but it's not full software engineering. It is you're writing a Turing complete language or a complex declarative language like SQL, but you're doing it in a highly constrained way. And you're usually doing it in some coarse grained container of code.
In DBT, that's a model. In Jupyter Notebooks, that's a cell. And it limits the amount of context a human has to have in order to work properly. But it turns out one of the most interesting things about AI is that what's good for the human is good for the machine and vice versa. Like, you want obvious APIs. You want to limit the amount of context someone has to do their head or literally the number of tokens in the, that are in context in order to do stuff well. But the key thing as well is that the code that gets generated needs to be precisely interpretable by both a machine and a human. Because if you have to debug stuff or you have to bring someone in to help debug stuff, it needs to be something with deterministic behavior that is understandable.
So I won't go through the entire article because that's a whole thing. But, basically, I think that the right AI cogent targets have coarse grain containers of code. They code to some high level framework or DSL that's precisely interpretable, but still part of the software development life cycle because that is absolutely essential because you need to have guardrails to check to make sure the generated code is correct. You need guardrails for the AI slop. So that's how I kinda consider how you need to think about how to design frameworks for the AI native era.
And the components are designed with all this in mind, high level framework with built in documentation that's customizable by the user. I think documentation will be viewed increasingly more as a store of context. So documentation needs to change, meaning that the purpose of it is to provide context to the LLM. And so that's one of the reasons why we, like, focus so much on built in docs for components and allowing for custom component authors to inject the domain specific context right there in your code and be able to allow l m to scrape it. Yeah. And then high quality error messages, critical to provide feedback.
But I think it's, like, just very important. The concept of building a good AI native framework, the goal is to dramatically accelerate the work of the software engineer, not to abstract them away. And I don't design like that just because I like software engineers, and I don't want them to go away. That is true. So maybe there's some subconscious, you know, psychology going on there. But more importantly, I think in essence, it is correct. The reason why AI is so exciting, if done right, we can abstract away so much of the drudgery, so much of the toil of software engineering, and focus much more exclusively on what you uniquely have judgment on. So to me, the most exciting thing about AI is the ability to abstract away enormous swaths of incidental complexity that we didn't think was possible.
But in order to do that in a way that's effective and safe and actually leads to higher quality system, the framework designers absolutely need to optimize for it.
[00:46:53] Tobias Macey:
And, also, the other element of working with these AIs effectively is that it removes us from, to your point, the drudgery of dealing with boilerplate, dealing with very narrowly scoped problems, and moves us up to thinking about what is the actual system level requirement and how do I get there so that the LLM can focus on those very narrow domains to be able to stick them together in a way that is composable.
[00:47:20] Nick Schrock:
Yeah. Exactly. It's all very exciting. You know? I hope that people can approach it from a abundance mindset rather than the fear based mindset. But I think it's I think it's more that any it's a radical change even if it's gonna be from the better. And that is stressful and anxiety inducing, and I totally get that. Like, I am in my own development, I am probably not as AI native as, like, I need to be. You know? And, you know, at this point, I'm an old man, so I need to really work on maintaining that brain plasticity to learn new stuff. So I get it, but, it's also very exciting.
[00:47:56] Tobias Macey:
I take umbrage of that because I think we're the same age.
[00:48:01] Nick Schrock:
At this point in my life, I wake up very early, and my my son always asks me, Dan, are you up because you're an old man? I'm like, yes. That's that's my mom. My son is six. He's very charming.
[00:48:13] Tobias Macey:
And so as you have been exposing these new capabilities, working with some of the early adopters for the DG, the scaffolding, the components, interfaces, what are some of the most interesting or innovative or unexpected ways that you've seen those capabilities applied?
[00:48:29] Nick Schrock:
Yeah. Like I said, we're going with a fairly limited set of design partners. But even among that set, there's been a bunch of great stuff happening. One of our users is onboarding his Databricks kind of using stakeholders, data scientists who work in notebooks hosted notebooks in the Databricks environment. And he you know, the first thing he did was write his own custom component to and there's a REST API for running one of these things. Right? So you wrap a custom component around that. You can basically tell your data scientist, like, hey. If you wanna schedule it within the context of data platform, copy and paste the notebook URL, put in this YAML file, like, fill out these things. You're good to go. But then he realized the power of the customizable config system, and he started kind of putting all the DevOps stuff in there too. So configuring memory, how the integration with Datadog should work in the context of this thing, like, what metadata to put everywhere and stuff. So it ended up being this, like, kind of single spot where this intrepid data scientist integrating his notebook into the production data platform can control a bunch of different parameters of stuff. So I thought that was really cool. Another user went kinda I would call it hog wild on the number of custom components he built, and he was all sorts of crazy legacy systems and talent and I think even inform all this stuff. And so that was really heartening that someone was able to kinda churn out so many custom components as early in the life cycle of the system. And then another one that comes to mind is that the moment that one of our users saw the new project layout and it's kind of hierarchical nature, he also saw this capability we have. We're kind of, like, in the YAML file at any point in the hierarchy. You can kind of post process all the definitions that were above it to, like, apply a common tag or apply the same metadata. And that allowed a really nice separation of responsibility where the data platform engineer could programmatically apply governance information across the entire system very smoothly because it kind of changes the way that you can abstract things. Because you basically you tell your stakeholder, just put this tag on this asset. Okay? And then later down the line, the platform engineer can process that tag and then decide how to interpret it and, like, produce all sorts of derivative metadata and, like, who's the owner and what team is it on and, like, this piece of metadata indicates this policy, and I'm gonna write some other piece of code that queries that and makes decisions on it. So the fact that, like, the you know, you show the capability, and then instantly, the user is like, wow. I can use this to programmatically control all my governance in a smooth way. Like, that was really cool to see.
[00:51:22] Tobias Macey:
And then a natural outgrowth too of these components is that it gives you a reproducible and a packable abstraction that you could feasibly build a community library around of here are all the different components that people have published for their use cases so that somebody who is new to Dagster can come in and just pick and choose, carte blanche, whatever it is that they want to be able to LEGO brick their system together to get up and running.
[00:51:50] Nick Schrock:
That is the idea, and we didn't even talk about this beforehand, but you just you're setting me up perfectly. You know, one of the things we're building into this is the integrations marketplace that we wanna be able to kind of index integrations from both our own monorepo, the community repo, as well as internally built components. So at the, you know, at a at scale customer, we want the centralized data platform team to kind of be able to publish components into a searchable index that has all sorts of metadata and, like, embedded docs and, like, copy and pastable cost digits by, like, how to install this thing. So, yeah, we really wanna kick off an ecosystem effect for these things.
[00:52:30] Tobias Macey:
And as you have been building these component capabilities, working with teams, helping them come to grips with how to accelerate their work with AI, how to build for AI with Dagster, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:52:50] Nick Schrock:
That's a good question. I think it is very important to balance innovation with change management, and that can be very challenging. You have to be very careful to introduce stuff incrementally, to allow incremental process that provides value at every step along the way. Like, when you're adding new capabilities and you're asking users to do any sort of code change, you have to provide value instantaneously every step along the way and then communicate about it properly. And that can be a challenge. So there's that. And then the desire to incorporate stakeholders and just how universal the need is to incorporate people into the data engineering process continues to kind of astound, actually. You know? Just data is so instrumental to so many people's jobs, and that person who it's instrumental to is, like, the only person who understands it. So it's just allowing them to participate directly into the process reduces so much context switching and painful context switching and collaborative processes that there's just it ends up there's a huge organic demand for, that sort of thing. So I think that that's also been kind of a a lesson learned here.
[00:54:13] Tobias Macey:
And so as teams are trying to tackle their data challenges, they're ramping up on AI use cases, what are the cases where Daxter is the wrong choice?
[00:54:24] Nick Schrock:
Well, we're fundamentally a batch processing system, you know, that can get to semi real time. But if you need microsecond latency, like, we're not the right tool for that. Highly dynamic, you know, kind of computations that require loops and can't be structured into a DAG structure. Systems like temporal are more appropriate. They make different trade offs in terms of, you know, we provide much more structure and constraints built in lineage, all this stuff. And they have a much more complex state machine, but it is more generic, more imperative, and more flexible. So there is that use case as well. So those are the two that come to mind.
[00:55:07] Tobias Macey:
And as you continue to build and iterate on DAXTER, on components, as you continue to keep pace with the demands of the AI age that we're in, what are some of the things you have planned for the near to medium term of DAXTER that you're excited to dig into?
[00:55:24] Nick Schrock:
Well, you know, like I said, you know, obviously, I'm working with components, and I'm very excited for the, this kind of continuing the journey of expanding that ecosystem. And this in app editing is gonna be amazing. And it's in app editing that I can get into because it actually ends up still checking in the stuff source controlled, running tests on it. And that provides us nice layering so that it can make stakeholders handy happy, and it can make the engineers happy. Most importantly, it can get good outcome. So we have an entire long road map on the components front for that. In terms of other things I'm super excited about, you know, DAXR is in a unique spot because we have meta information on the integrations.
We have metadata information on the actual definitions defined in code. We have operational metadata. We have all sorts of very interesting metadata. And I think we can involve DAX or not just to for, like, a useful operational tool, but as a generalized context store for all sorts of tools, including our own across the platform. When we were kinda discussing this component stuff, it's like, okay. Yes. We're gonna design an abstraction that's good for native AI AI cogent, but what else why else do we have the right to win here? And we have the right to win in our opinion because we have a unique view of all the context in the system across every tool, across the way that you define your pipelines and code, all sorts of stuff. So I'm very excited to kind of, you know, not just have AI accelerate authoring of data pipelines, but to have DAGSTER's contextual information power, AI use cases of all sorts. And I think we're in a great place to do that. I kinda jokingly refer to it as the mother of all MCP servers because we can, like, aggregate the MCPs of all our integrations, like, all sorts of interesting stuff, ingest tons of information. Like, our DBT integration, we ingest the full model code. Right? So we can provide that context directly in the same API where we can provide, like, the Python definitions of completely other different technologies, as well as, like, information about when this last failed, as well as information about, like, hey, what things are upstream of this? So we have a great opportunity to be a really compelling context store for AI tools operating on the data platform and across business.
[00:57:43] Tobias Macey:
Are there any other aspects of the age of AI, the impact on engineering practices, the work that you're doing at Daxter,
[00:57:52] Nick Schrock:
or any other related topics that we didn't discuss yet that you'd like to cover before we close out the show? Well, there is any number of topics I have opinions on. But, I feel like we have, you know, we've been talking for a while. I don't wanna overwhelm anyone. So I think we can, wrap it up there. I guess, you know, it's changed, but it is very, very exciting. I might you know, I'm often a skeptic of such things, but, these AIs are really doing incredible things I wouldn't have believed were possible even a few years ago. So it's, pretty cool to see.
[00:58:23] Tobias Macey:
Yeah. I was skeptical at the outset as well, but I have been reasonably impressed in recent months with some of the capabilities and the ways that it can accelerate work to be done. So definitely
[00:58:37] Nick Schrock:
Claude is really good at Cogent now, and I find that ChatTPT, it's the o three model, is really incredible for doing kind of research on the Internet. And, like, I also love now how it's, like, showing you what it's doing. Like, hey. I'm fetching from here. I'm fetching just that that transparency actually builds a lot of trust, so you kind of can guess that it's doing stuff correctly. So, yeah, the the use cases are pretty incredible.
[00:59:02] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your contact information to the show notes. And for the final question, I'd like to get your perspective on what you currently see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:18] Nick Schrock:
Ugh. You really had to ask me that. The, well, it's hard coming on as a vendor and asking that because, of course, the thing I'm working on is, is the most important thing. I guess, to me, the missing thing is that all of the hyperscalers and all the data hyperscalers, they are all trying to build walled gardens. And we're maybe going back to a world where there are Databricks developers and Snowflake developers. And more importantly, there are Databricks companies and Snowflake companies and AWS companies and GCP companies. And I think that is a bad state of the world for the engineers because you want people to be able to move flexibly between different companies and have skills be portable, and we should really still be striving for a world of open standards. So, you know, I don't know if Daxter is the right tool. Well, I I mean, I think we can play a part, but I think there's other parts in the ecosystem that need need to step up too. But I hope we live in a world that's less vertically integrated and more horizontally integrated.
And kinda like anyone who can help out with that by building standards, building open source tools, making that story better, that is great. Because I think, like, a a world of five walled gardens is kind of a sad one.
[01:00:29] Tobias Macey:
Yeah. Absolutely. Well, thank you very much for taking the time today to join me as usual and for all the great work that you're doing on Daxter and making that easier for folks to adapt to the changing ecosystem. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It means a lot. Appreciate you having me on the show. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about Soda. With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds.
And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda to sign up and follow soda's launch week, which starts on June 9. Your host is Tobias Macy, and today I'm welcoming back Nick Schrock to talk about lowering the barrier to entry for data platform consumers and the impact that the current era of AI has had on data engineering. And so, Nick, for folks who haven't heard you before in your numerous past appearances, can you give a quick introduction?
[00:02:21] Nick Schrock:
Yeah. Tobias, thanks for having me back. It's always a pleasure. So my name is Nick Schrock. I'm the CTO and founder of Daxter Labs. We're the company behind Daxter, which is a data orchestration framework. And, yeah, kinda my background is I cut my teeth at Facebook engineering. And the project I was most known for prior to Dagster was GraphQL, which I initially created, and we open sourced it. Well, first, we built it inside Facebook, and then we open sourced it. And it become a broadly used technology. But then I moved on to data engineering and data platforms, and I've been working on Daxer for quite a while now.
[00:02:58] Tobias Macey:
And so we don't need to get too much into the usual flow because we've covered that in past episodes. But I think that given the current state of the industry and all of the hype around AI, I'm wondering if you can just start by giving your summary of the impact that the overall adoption and growth of AI and automation and agents has had on data platforms and data teams?
[00:03:26] Nick Schrock:
Yeah. That's an interesting question. I guess, just to set a framework for this, I think there are varying degrees of how people view AI. I like to joke around and call it, like there's, like, AI boomers and AI doomers. And AI boomers are like, oh, we've seen this before, blah blah blah. And AI doomers are like, this is the end of all human labor. It's going to zero. We should just bomb the data centers, literally. And I am I would call it squarely in the middle. I think it's gonna be incredibly disruptive. I think it's gonna be as important as, say, the transition from the industrial to the information age or the, pre industrial to industrial age.
But I think that it will be a massive productivity boost to lots of people, including including engineers. And far be it from software engineers going away. I actually think it will expand the number of people writing software and will make them more leveraged. So, you know, in terms of its impact on the software engineering in this industry, I'm very far from being a a doomer. I think it will be a renaissance in software engineering, and it's super exciting, but it will fundamentally change that. In terms of its impact on the data platform space specifically, I think in reality in the day to day lives of practitioners working in data platforms, it's kind of like there's been an earthquake.
There's a tsunami out there, but it hasn't really hit shore yet. And what I mean by that is that I think lots of people are using AI tools to write software in an accelerated manner. I think lots of people are starting to work on AI projects at their various organizations, you know, especially use cases involving structured data. And I think some of their tools outside of their code editor do have AI features, but I don't think it has fundamentally changed their world yet. So I think everyone's kind of waiting for it and, you know, figuring out how to adjust to this new future.
But I think it's like the we're at the beginning of the inflection point, so it's kind of this odd state for most people where their day to day hasn't changed that much, but they know it's going to in some time horizon. I don't know if that resonates with you. But
[00:05:58] Tobias Macey:
Yeah. It absolutely does. And I was actually, recently giving a presentation for another company to their engineering team because they wanted to get my thoughts on the future of data engineering, the impact of AI. And I think that there's definitely a lot of new work to be done, but the fundamentals of the work don't change. There's definitely a lot of changing in terms of the specifics of the tooling, but in principle, the job of data engineers and and the way that I distilled it for the presentation was that the role of data engineers is to turn raw information into knowledge and enable the business to make use of that knowledge to either make better products, power the applications that they run, make better decisions, etcetera.
And using data to power these different AI systems maybe takes a different shape because you're bringing in, vector index versus just a star schema, or maybe you're able to unlock more value out of the unstructured data that you've been storing since Hadoop hit the scene. But, ultimately, the core fundamentals of the job are the same where you find data that can be used for something. You run it through some sort of transformation. You get it into a manner where you imbue context and semantics into the data beyond just the raw bits, and and then you feed that into some downstream system, whether that's business intelligence or an LLM or web dashboard or whatever might be the case. So the work is fundamentally the same. It's just the shape of it is evolving a little bit, and the speed is probably increasing. And I think that there there's another interesting aspect of it where the way that I build the distinction is that you're either building for AI where you're using data to feed it into an LLM or you're building with AI where you're actually using the LLM to generate that transformation and help you iterate faster and find anomalies, etcetera.
[00:07:56] Nick Schrock:
Yeah. That's right. That makes sense. You know? An analogy I like to use is that when accountants first saw the spreadsheet, they were probably like, it's over. Right? And when people saw calculators, they were like, oh, no one's gonna need to know math anymore. That's fundamentally not true. You need to know the principles in order to evaluate and use the tools. I do think that computer you know, data engineering especially is so critical, and evaluating correctness requires so much business context and localized context that it will, again, fundamentally change whether the practice, but there will need to be a human who has deep understanding of technical systems and business context in order to make these things work.
[00:08:41] Tobias Macey:
I think too that's interesting because with the injection of these LLMs and AIs and Copilots into things like software engineering or even some of the, like, the Microsoft Copilot getting embedded into various office suites and Gemini trying to hook into all of the Google products. It highlights and accentuates the work that data engineers have been doing in the background for so long because everybody's trying to use these AIs. And, ultimately, it works in some cases, but as soon as you try to go outside of the bounds of what the system was specifically built for, for instance, with if you're a software engineer and you're iterating with Copilot and your editor, and then all of a sudden you say, oh, I actually need to build something that touches on some other data system that isn't directly embedded in my application. All of a sudden, you have a problem because the LLM doesn't know anything about it, and then you have to do your own little bit of data engineering to be able to pull the information into that context to be able to enable the AI to do what you want it to do. So Right. It's just kind of bringing more of that work into everybody else's day to day that up until then was just, oh, hey. I need this data, so I'm gonna go throw it over to the data team and ask them to do it for me.
[00:09:53] Nick Schrock:
Yep. Makes sense.
[00:09:56] Tobias Macey:
And so digging now into what you're building at Daxter, again, we've talked about the the fundamentals of it and some of its evolution in previous episodes. But for people who maybe listened to the last episode, which was, I think, maybe a year or so ago, I'm wondering if you can give a bit of an overview of what has changed, some of the new stresses that the evolution of the data ecosystem, the impact of AI has had on the ways that you think about data orchestration and the role of Daxter in the overall stack.
[00:10:29] Nick Schrock:
Yeah. So what's happened the last year so I'll do it from a very, like, Daxter centric, way since that's the universe that I'm in. What we see is people doing much more advanced things with the orchestration system, wanting to get much deeper observability into their systems. And then also, you know, we also see a more and more centralized data platform teams and data engineers who are building frameworks for less technical stakeholders. And that fundamentally changes their job from building data pipelines directly to dynamically building data pipelines based on what some other stakeholder wants them to do. And that kinda, like, led to the product developments that we're doing today.
You know, I think the other thing that we've seen is that in a predictable fashion, the what I'll call the data hyperscalers, Snowflake and Databricks, are beginning to attempt to consolidate as much as possible and building tools in every single vertical. And so that is quite interesting. The counterforce to that is the full embrace of open table formats, which is a big trend and sort of the standardization of the term lakehouse to describe data stacks that are built over these open table formats. So I think that's a huge megatrend as well that we're seeing. Icebreak is kind of similar to AI in that it's kind of this, like, tsunami that is coming. But, you know, adoption is still, I would call, modest. But it it it will come, and it's pretty exciting.
[00:12:21] Tobias Macey:
And in terms of Dagster itself, I know that in one of your recent releases, you introduced this concept of components where you're focusing on trying to modularize and standardize the different elements of the transformation flow, allow for people to be able to have reusable and more quickly instantiated data assets based on particular concepts and guardrails, which is a fairly notable change to the way that the framework has worked up until now where if you wanted to do anything, you needed to dig into some Python code, figure out how it all wires together. And I'm wondering how that has changed the ways that teams who are using Daxter, either up until now or in particular people who are newly onboarding onto Daxter, how that changes the overall collaboration patterns for people who are consuming data or working with data. I'm thinking in terms of data analysts, analytics engineers, but also application engineers and how that changes the work to be done for these data platform teams and people who are closer into the infrastructure and the technical details of the data pipelines.
[00:13:35] Nick Schrock:
Yeah. Lot to unpack there. The, you know, the components, has been in preview for a couple months, and we'll be releasing it to release candidate in July. So we've been working with select design partners to work on it. I guess I'll start with what the trends we saw among our usage of both us and also data platform teams that were using other orchestrators. And we kinda, like, saw a bunch of patterns and converged on the single project, which you think addresses a bunch of the issues. I guess I mentioned it in the last question, but, like, tons of people are dynamically building data pipelines, meaning that they're not directly just authoring tasks. They're not just directly building operators in airflow. They're not just writing the asset functions in Daxter. They are working at a higher level abstraction and rolling their own systems to programmatically generate those things based on higher level APIs that they present to their users. Okay. There's that. Many of them who are doing that are doing that with a config driven or some sort of front end. Right? YAML, JSON, even, you know, persisting it in a database, you know, all sorts of stuff. With that generation, they also programmatically generate metadata in a pie policy across their data platform.
And, you know, the I think the other thing is that the data orchestrators are all introducing more concepts. So tasks, assets, we have asset checks. There's metadata concepts, sensors. You know, there's like a whole bunch of individualized abstractions. Usually, when someone's interacting with the orchestrator, they often are integrating with a specific technology, and they don't wanna think in terms of those lower level things. For example, DAGSTER, when it integrates with dbt, ingest the entire model graph and surfaces each one as a software defined asset, which is kinda what what makes our dbt integration best in the business. It's code that generates those things. It programmatically scrapes the model graph and dbt and code generates stuff. So and the job to be done for the orchestrator is, like, integrated with the project. And then lastly so there's a bunch of stuff I know. But and then lastly, they wanna be able to bring in more stakeholders with friendlier interfaces and ideally have AI friendly Cogent targets. So we saw all those trends, programmatic generation of pipelines, config driven pipelines, the desire for a high level abstraction, AI native cogent, and that's components is what came out of that, which is kind of the project release. And I think the the way to think about it generally is that it provides a integrated way with the framework to, in a principled way, programmatically generate definitions. Right? And the killer use case for that typically is a YAML front end that you can present to your stakeholders.
But I also wanna emphasize, there are lots of people who don't wanna program in YAML, and trust me, I understand completely. I like types and turn complete languages and all sorts of stuff. So it's not just YAML. It's also a lightweight Python API on top of that. But in effect, you kind of separate metadata from the underlying complicated code. And that metadata can be expressed in YAML, right, or in very lightweight Python. Right? At the end, it's like Pydantic models. So you can program against that if you prefer it. But in that way, we have a a native way to programmically build definitions in the framework, and it changes for those of the Pythonistas out there. It makes it so you defer definition generation until after the Python import process is complete.
And I cannot emphasize how important that is to build reliable systems that dynamically generate these things. If you're using Airflow or Daxter today, when you programmatically generate the definitions, it's happening at Python import time. And if you're talking to databases or doing something computationally expensive or if you wanna unit test that thing, you do not want that to happen, actually. So kind of like the core thing here is that components is composed ability abstraction that allows you to dynamically and in a deferred way load up definitions that and by definitions, it ends up being the structure of the data pipelines. But the killer use case, I kind of was talking on highfalutin terms there. I think the killer use case that people understand and what's meeting meeting them where they are is providing an integrated, tool rich, self documenting YAML DSL with a pluggable back end for your users.
And it really is a lovely interface between data platform engineers and their stakeholders.
[00:18:21] Tobias Macey:
So as you're describing that and the philosophy around it, it puts me in mind a lot of things like the separation of concerns that Kubernetes is focused on, where you have the infrastructure and compute layer that the DevOps and the infrastructure engineers are responsible for and then the API and the the user space layer that people who are building applications that they want to deploy are integrating with. And so it gives a shared infrastructure with a clear delineation between responsibilities for people to be able to build on top of, which has then enabled a massive ecosystem of other capabilities built across both of those dividing lines.
Another thing that comes to mind is the, the the very declarative infrastructure as code community that is built up around things like Terraform and Pulumi of you have cloud providers. All the cloud providers as you're deploying these resources have state that needs to be maintained and tracked. And, similarly, in data engineering, you're building complicated resources that interact with each other in dynamic ways that are all dependent on state that you need to be able to understand and maintain and operate across. So I think that what you're building with components brings a lot of those same ideas into the space of actually building these data pipelines where you can have that interface boundary between the platform team and the consumers thereof, in a similar way as what we've done in the kind of cloud native ecosystem.
[00:19:52] Nick Schrock:
I couldn't have said it better myself, Tobias. That was great. No. Our data engineer who runs our own platform, which is fairly substantial, actually, now that we're a more mature SaaS business, the way he expressed it, he's like, finally, I have a front end for the data platform, which I think is kind of a and that's what he was he's saying what you were saying, a much more simplified term where it's like, I manage all this state and but there's just this clear abstraction layer and there's a front end for it. And even when you're working by yourself, it's very useful to have this abstraction so you can kind of switch your brain when you're working on stuff. And then there's also, like, a very concrete advantage to slap in a bunch of metadata in, like, a separate spot that either is Python with, like, no dependencies or YAML, which can be loaded dynamically, is that you can, like, do syntax checks, like, very, very quickly.
So it can speed developer loops and CI a lot. If you're moving a lot of activity into, like, the Gammel or metadata space, the feedback loop's super super fast too. So there's kind of interesting product implications as well.
[00:21:02] Tobias Macey:
And then the topic that we, I think, have to touch on because every time somebody says, oh, I've got this great abstraction layer. It's gonna make everything easier. You don't have to worry about it, which puts me in mind a lot of the the various cycles of low code or no code tooling is that everybody says, great. It'll be so much easier for you to build these things. I've worried about all the complex stuff for you, and that works well to begin with. And then you have to start doing things that are specific to your problem domain, addressing edge cases, and then you start bumping up against the capabilities of the system, and you end up having to just drop down to a lower and more complex level to be able to actually get the work done. And so I'm wondering how you've thought about that balance of making things very easy to use, opinionated, constrained versus maintaining the flexibility and adaptability that's necessary for such a complex domain.
[00:21:55] Nick Schrock:
Yeah. And I think this comes from having a lot of experience. Where things go wrong is where people think they can eliminate more complexity than they actually can with a framework. And it imposes too many constraints on itself, and it's not sufficiently customizable. The reality is that every business is complicated and specific, and everyone is in their known context. And so you can't know the complexity of all those things. What you can do is provide tools and abstractions and infrastructure so that platform engineers and engineers in general can subdivide that complexity into consumable, understandable parts.
And then the other thing a system can do is provide cross cutting complexity reduction that is domain neutral. So I think if you understand those two things and do that, you get this right balance of having thing that actually reduces the essential complexity of a program as well as allows you to scale the program well. So, for example, in, like, components, we built all this tooling around the YAML front end. So there's, like, really nice error messages. You know, you can run-in CI. There's a CLI interface to it, all this stuff. That is just complexity that cuts across all domains. Right? It's basically useful for anyone who's using that technology.
Then there's the other stuff about, like, how do you make it so that, you know, people can kind of put their take the complexity of their world and put it into a containable chunk. And that's why this is a Python native system with porous borders between YAML and Python, where the data platform engineers are empowered to build custom components, right, that have a structured front end. They can package up this complexity, have it be self documenting, have a nice YAML front end for it. But we're not pretending, like the job of building that custom component is not going to be difficult.
And it's not gonna be difficult because it's not difficult because we're making it difficult. It's difficult because it is difficult. There are just problems in your world that we cannot know and that are complicated. And you are a smart person, and we we wanna get out of your way when you're doing that. But we want you to be able to capture that, like, complexity in a nice consumable chunk and present it to your Upwork stakeholders. So I think it's just like this knowing what you can assume control over and support and still providing that flexibility to the user of the framework so that it's adaptable to their own needs.
[00:24:50] Tobias Macey:
I think another interesting element of where we are in the industry, both data and otherwise, is that the introduction of generative AI and the capabilities that that engenders has brought the use of data more fully into the space of application engineering, where up until now, the application had its own data that it cared about and that it maintained that data engineers would then extract and rip out of context and then have to rebuild that context for various business use cases. And now we've come full circle where the data across the organization is now getting fed back into the application context via these LLMs and things like rag systems and fine tuning and needing to be able to do things like manage semantic memory for the LLMs, etcetera.
And so that means that application engineers need to be more aware of the data that exists within the organization to be able to power those use cases and more empowered to be able to actually operate on that data to address the needs of the application in the ways that these LLMs are using that context. And I'm wondering how you're seeing that and the work that you're doing with components play out in terms of bringing application engineers more into the space of operating across organizational data and the interaction patterns that they have with data platform teams, data analysts, business stakeholders, etcetera.
[00:26:18] Nick Schrock:
I think in the AI the one way I think about it is that I think in the AI era and this was becoming more and more true as more data platform assets were being incorporated into production app logic. But this is just gonna supercharge that, which is like and the phrase, I think it's, like, data engineering is becoming software engineering, and at the same time, software engineering is becoming also partially data engineering. Because you need to do some data engineering on your application data in order to correctly feed it into things that feedback loops into your application. So I think there's two things in Daxter that help with that. One is that we have developed this protocol called pipes, which allows you to invoke Daxter native compute in external programming languages in a super lightweight way.
So you we have pipes clients for TypeScript, Rust, Java, I think couple other languages. I should probably I I know some users have done, like, some in c sharp. But, effectively, that allows a a user to write code in their native language, and then we provide lightweight APIs to stream back metadata back to us. And we also launch that process. Well, we actually don't launch it. We, in a pluggable way, can inject context into that process so that they can get, like, what partition is being materialized or any other sort of config. So we kind of have a back end protocol, so you can write data processing logic that needs to be in the data platform in the language of your choice. And then second of all, components is sort of the front end for that, meaning that when you're in the orchestrator land and you need to connect the business logic you wrote in some other programming language to where it needs to execute in the orchestrator and the metadata around that, you can use components in order to kind of set that up.
And the goal of components is so that someone who's sort of, like, external to the data platform can sort of wander in and do what they need to do without learning a complicated framework. They just kinda, like, see you know, they see where their teammate put a similar thing. Maybe they copy and paste the file or they know the scaffolding command that scaffolds up the same thing, and then they can just, like, edit some configuration. There's a type ahead in the editor, and there's embedded documentation. They can verify it, and then they can go on their merry way. It's kinda like two sided. We want DAXTER to be a multilingual ecosystem, be able to have a DAXTER native experience while having a very lightweight mechanism for doing that in other programming languages, and then have a very easy way for a stakeholder to come in and incorporate and integrate their compute into the data platform.
[00:29:06] Tobias Macey:
One of the interesting things that I'm dealing with in my own usage of DAX store right now is that we have built up a set of pipelines, asset definitions, etcetera, that are running in production. They all do what we need them to do. But as the use of AI moves more into that application layer, the application engineers need to be able to operate on data and be able to fetch and transform data in their own work. And so I've stuck with figuring out, okay. Well, how do I onboard people into the data platform more easily? And so one of the objectives is to do what you said and turn the existing repo of DAXTER code into more of a a set of platform capabilities that maybe get published as Python packages or what have you or these components so that the application engineers can actually write their asset logic next to the code that they care about and, you know, their Django app or whatever it might be so that there are no repository boundaries that they have to cross to be able to get their work done so that they don't have to have any boundaries in terms of a hand off to another teammate just to be able to close the loop on the thing that they care about. And so I'm wondering how you're seeing this introduction of components and the constructs that you've built up in Daxter up till now enable situations like that where you have that core capability of one team manages the data ingest, manages the definition of these assets, and then being able to hand off those assets to another team, particularly when you're in the case of, oh, I'm running on my laptop, and I need to be able to make sure that all the pipeline does what I want it to do so that I can make sure that my feature works on this data the way that I presume without having to replicate all of the data across multiple different environment boundaries.
[00:30:58] Nick Schrock:
Yeah. There's a lot in that. You know, I think at a very basic level, there's embedded documentation capabilities in Daxter where, you know, you can have long form descriptions that are marked down formatted that then appear on kind of the home page for the asset in the in Daxter, and a team can use that to establish a norm. They're like, hey. If you visit this home page, there's like a a little code snippet to know how to read it and something or a link to the right tool to read it. Daxter itself doesn't kind of enforce where or how you store anything. That's one of the other you know, it's an example of a problem where we're like, hey. The the world is complicated. We'll provide some built in integrations for stuff, but in the end, probably, you have to control what's going on. You know, in terms of I guess, you kinda spoke to two things, if to repeat what you were saying. One is kinda like I am a application engineer, and I wanna access the underlying dataset, like, literally the table in Snowflake or something. Right?
That in in that way, we are much more of just a nexus of metadata and documentation that you can point your users to. And it can make it very smooth for you, the platform engineer, to add information to that because you can just, like, add stuff to your source, add stuff into your repo, and then it gets exposed in a very accessible tool. So that's cool. Then there's the notion of the application engineer actually interacting with the Daxter platform and adding stuff to it, maybe in a different repo. And right now, certainly, the only way that we you know, in that scenario that I talked about, the way that we would envision it is that even with components, they would still have to go into a repo and submit a PR and go through a process. What's on our road map, however, is in app editing of these component YAML specifications or the the front end, if you will. And we did a hackathon where we prototyped it, and it's, like, super exciting because we think because in the end, from the user's perspective, it feels like in app editing. But in the background, it's actually submitting a PR in your path and triggering CI.
You know? But the user perspective will be, like, green, and then they can set save pretty much, and it'll submit the change to the platform. And we think that is super exciting. The other thing that this enables users to do, and we might go in this direction as well, because people do it already, is they have config systems that they expose to users in their native repo, and then they set it up so that Daxter programmatically fetches those configs from elsewhere and dynamically constructs the pipelines. That's, like, another approach that I think is, like, more advanced, but other teams are doing it today.
And, likely, we will support that in the as well in the future.
[00:34:03] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
And then the other interesting work that you've done is alongside the work of components, you've introduced this new DG CLI. You've introduced new ways of thinking about the structuring of your Daxter projects, and I'm wondering how that has changed the patterns that you've seen teams building with Daxter effectuate and how that simplifies the work of being able to bootstrap new capabilities or new data assets within the overall platform
[00:35:11] Nick Schrock:
implementation? So I think what DJ really provides, it's just a shorthand for the CLI, is an opinionated project layout sort of inspired by the Ruby on Rails style. You scaffold things. You don't have to manually import them. You can enforce conventions because you can customize the scaffolding. So it creates the config file associated with the integration in the same exact spot in every single place where you instantiate it, and it has some schema prefilled or something. It sounds simple, but it's actually, like, really simplifies dealing with it just reduces decision fatigue a lot. You know, concretely for DAGSTER users, the user friction we wanted to solve with this as well is what I call the import circus where, you know, if you have like, you split all your stuff among a bunch of modules.
In order to construct the DAGSTER definitions object at the root of your project, you often have to do a bajillion imports, or then we have these facilities, which kind of, like, dynamically loaded symbols from another Python module. And that was just a pain in the butt, and it made it really hard to reorganize a project. And we instead just manage that for you. So with the new project layout, it's just way more elegant both to is this less code importing stuff around? But very importantly, it makes it dramatically easier to reorganize a project. Because, like, as you onboard new teams and stuff, just you wanna be able to move stuff around. And if you can make that seamless, that is great. It also makes it so that changes are far more localized because only the file that moves gets changed, not like all the bajillion places that import. And that has a bunch of that's like a trivial way of describing it, but it has much side effects. One of the hard things about building one of these projects is, like, how do I organize it? Do I subdivide the repo between teams?
Do I do it by the DAGSTER abstraction, or do I do it by some other dimension? Like, it's a it's a multidimensional problem. We found is that it's much, much more elegant actually to subdivide the project at the technology level. Meaning that you have your dbt stuff here, your Sling stuff here, your Fivetran stuff here, your Kubernetes ad hoc jobs in this folder because it allows you to localize all the technology specific complexity in a specific subfolder and reuse it there. And then the people who are dealing with other parts of the platform don't see it or think about it or anything. And, typically, when you're working and when you're writing code in the data platform context, you're usually doing it in the context of a single technology.
Right? You're, like, going in and, like, I'm changing the dbt models or I'm changing this ingest. And the only time when the cross technology stuff matters a lot is when you're doing integration testing, reviewing it in the UI. So you organize the code by technology, but then you allow other cross cutting views in the UI or, like, in the output of the CLI tool. So I think that was an interesting insight. That is not obvious when you kind of first start building one of these platforms.
[00:38:38] Tobias Macey:
And to that point of slicing along the technology boundaries, I think that also lines up fairly well with the data asset constructs because any individual asset is largely going to be owned within a technology boundary. So a DBT model, a table that is generated from an airbyte or a Fivetran ingest, the s three object that gets generated from some external process. One of the other things that I've been building a lot around is the different resources, custom IO managers, and I'm wondering how that factors into the ways that the DG and the project scaffolding thinks about the breakdown of the system where maybe I have an IO manager specifically for handling file based objects in either object storage or local disk, or I have a resource that has a base module that is an OAuth client, but then different implementations of that for different APIs and just how that gets used across these different submodules within the DG project scaffolds.
[00:39:48] Nick Schrock:
Yeah. I think the the nice right now, if you don't use the project layout or you don't use these kind of more advanced APIs, which are a bit hidden, then you're ended you generally end up with a global dictionary of resources at the top of your project, and we wanted to get rid of that. So what DG allows you to do, which aligns very much with this organized by technology thing, is place the resources that are, like, relevant to the other things in that directory right next to it. And then you have one spot in your project that's like, okay. Here's all the stuff that deals with this technology. And that's been, like, a very off, like, cognitive load reduction issue as well. But it also allows you to put one put it because it's hierarchical, you can put it at the right spot in the hierarchy. Because maybe you have some advanced resource to, like, talk to two technologies or something. Well, you put it at the parent folder and then because it makes sense for it to be there.
[00:40:50] Tobias Macey:
And then continuing on from the earlier conversation of the impact of AI and the work that you're doing with components and DG to bring more opinionated constructs and bring more scoping to the problem is the impact that the generative AI capabilities has had on the actual creation and maintenance of code and systems. And I'm wondering how the work that you're doing in Dagster is designed to align with the needs of these AI systems for managing context, managing input, and enabling data engineers and application engineers to generate more of this automatically without necessarily having to have as deep domain knowledge of either Dagster or the specifics of the underlying technologies.
[00:41:44] Nick Schrock:
Yeah. 100%. I mean, we could probably talk for a few hours on this, so I'll try to keep it brief. But I guess, first of all, to set context on this, I have a pretty specific view of how you make a framework optimized for AI cogent. Because I think without proper constraints, AI is a hallucinating demon that is a technical debt super spreader. It can be a serious problem, and it's very easy when you're doing AI cogen to end up in a spot where you do not know how it's working. If there's a bug, the AI can't debug it, you can't debug it, and you're just in this unstable place, and you basically have to throw it away and start over. So I actually think it's very important to structure these systems to allow for the code to be ephemeral and disposable if things go wrong. It's one of the reasons why components was designed that way is that there's, like, only one spot in the project that gets edited. And then if something goes wrong, you'd be like, okay. And because the cost of creation is so low, like, regenerating something isn't that big a deal. So I think these frameworks and AIs will have this reflexive relationship as they go along, and frameworks will have to be LLM native.
You know, I wrote this article last summer, actually, which influenced a bunch of my thinking and or kind of explained it that led to components. And I call it there's this rise of what I'll call medium code. Meaning, it's not low code or it's not, like, no code click up stuff, but it's not full software engineering. It is you're writing a Turing complete language or a complex declarative language like SQL, but you're doing it in a highly constrained way. And you're usually doing it in some coarse grained container of code.
In DBT, that's a model. In Jupyter Notebooks, that's a cell. And it limits the amount of context a human has to have in order to work properly. But it turns out one of the most interesting things about AI is that what's good for the human is good for the machine and vice versa. Like, you want obvious APIs. You want to limit the amount of context someone has to do their head or literally the number of tokens in the, that are in context in order to do stuff well. But the key thing as well is that the code that gets generated needs to be precisely interpretable by both a machine and a human. Because if you have to debug stuff or you have to bring someone in to help debug stuff, it needs to be something with deterministic behavior that is understandable.
So I won't go through the entire article because that's a whole thing. But, basically, I think that the right AI cogent targets have coarse grain containers of code. They code to some high level framework or DSL that's precisely interpretable, but still part of the software development life cycle because that is absolutely essential because you need to have guardrails to check to make sure the generated code is correct. You need guardrails for the AI slop. So that's how I kinda consider how you need to think about how to design frameworks for the AI native era.
And the components are designed with all this in mind, high level framework with built in documentation that's customizable by the user. I think documentation will be viewed increasingly more as a store of context. So documentation needs to change, meaning that the purpose of it is to provide context to the LLM. And so that's one of the reasons why we, like, focus so much on built in docs for components and allowing for custom component authors to inject the domain specific context right there in your code and be able to allow l m to scrape it. Yeah. And then high quality error messages, critical to provide feedback.
But I think it's, like, just very important. The concept of building a good AI native framework, the goal is to dramatically accelerate the work of the software engineer, not to abstract them away. And I don't design like that just because I like software engineers, and I don't want them to go away. That is true. So maybe there's some subconscious, you know, psychology going on there. But more importantly, I think in essence, it is correct. The reason why AI is so exciting, if done right, we can abstract away so much of the drudgery, so much of the toil of software engineering, and focus much more exclusively on what you uniquely have judgment on. So to me, the most exciting thing about AI is the ability to abstract away enormous swaths of incidental complexity that we didn't think was possible.
But in order to do that in a way that's effective and safe and actually leads to higher quality system, the framework designers absolutely need to optimize for it.
[00:46:53] Tobias Macey:
And, also, the other element of working with these AIs effectively is that it removes us from, to your point, the drudgery of dealing with boilerplate, dealing with very narrowly scoped problems, and moves us up to thinking about what is the actual system level requirement and how do I get there so that the LLM can focus on those very narrow domains to be able to stick them together in a way that is composable.
[00:47:20] Nick Schrock:
Yeah. Exactly. It's all very exciting. You know? I hope that people can approach it from a abundance mindset rather than the fear based mindset. But I think it's I think it's more that any it's a radical change even if it's gonna be from the better. And that is stressful and anxiety inducing, and I totally get that. Like, I am in my own development, I am probably not as AI native as, like, I need to be. You know? And, you know, at this point, I'm an old man, so I need to really work on maintaining that brain plasticity to learn new stuff. So I get it, but, it's also very exciting.
[00:47:56] Tobias Macey:
I take umbrage of that because I think we're the same age.
[00:48:01] Nick Schrock:
At this point in my life, I wake up very early, and my my son always asks me, Dan, are you up because you're an old man? I'm like, yes. That's that's my mom. My son is six. He's very charming.
[00:48:13] Tobias Macey:
And so as you have been exposing these new capabilities, working with some of the early adopters for the DG, the scaffolding, the components, interfaces, what are some of the most interesting or innovative or unexpected ways that you've seen those capabilities applied?
[00:48:29] Nick Schrock:
Yeah. Like I said, we're going with a fairly limited set of design partners. But even among that set, there's been a bunch of great stuff happening. One of our users is onboarding his Databricks kind of using stakeholders, data scientists who work in notebooks hosted notebooks in the Databricks environment. And he you know, the first thing he did was write his own custom component to and there's a REST API for running one of these things. Right? So you wrap a custom component around that. You can basically tell your data scientist, like, hey. If you wanna schedule it within the context of data platform, copy and paste the notebook URL, put in this YAML file, like, fill out these things. You're good to go. But then he realized the power of the customizable config system, and he started kind of putting all the DevOps stuff in there too. So configuring memory, how the integration with Datadog should work in the context of this thing, like, what metadata to put everywhere and stuff. So it ended up being this, like, kind of single spot where this intrepid data scientist integrating his notebook into the production data platform can control a bunch of different parameters of stuff. So I thought that was really cool. Another user went kinda I would call it hog wild on the number of custom components he built, and he was all sorts of crazy legacy systems and talent and I think even inform all this stuff. And so that was really heartening that someone was able to kinda churn out so many custom components as early in the life cycle of the system. And then another one that comes to mind is that the moment that one of our users saw the new project layout and it's kind of hierarchical nature, he also saw this capability we have. We're kind of, like, in the YAML file at any point in the hierarchy. You can kind of post process all the definitions that were above it to, like, apply a common tag or apply the same metadata. And that allowed a really nice separation of responsibility where the data platform engineer could programmatically apply governance information across the entire system very smoothly because it kind of changes the way that you can abstract things. Because you basically you tell your stakeholder, just put this tag on this asset. Okay? And then later down the line, the platform engineer can process that tag and then decide how to interpret it and, like, produce all sorts of derivative metadata and, like, who's the owner and what team is it on and, like, this piece of metadata indicates this policy, and I'm gonna write some other piece of code that queries that and makes decisions on it. So the fact that, like, the you know, you show the capability, and then instantly, the user is like, wow. I can use this to programmatically control all my governance in a smooth way. Like, that was really cool to see.
[00:51:22] Tobias Macey:
And then a natural outgrowth too of these components is that it gives you a reproducible and a packable abstraction that you could feasibly build a community library around of here are all the different components that people have published for their use cases so that somebody who is new to Dagster can come in and just pick and choose, carte blanche, whatever it is that they want to be able to LEGO brick their system together to get up and running.
[00:51:50] Nick Schrock:
That is the idea, and we didn't even talk about this beforehand, but you just you're setting me up perfectly. You know, one of the things we're building into this is the integrations marketplace that we wanna be able to kind of index integrations from both our own monorepo, the community repo, as well as internally built components. So at the, you know, at a at scale customer, we want the centralized data platform team to kind of be able to publish components into a searchable index that has all sorts of metadata and, like, embedded docs and, like, copy and pastable cost digits by, like, how to install this thing. So, yeah, we really wanna kick off an ecosystem effect for these things.
[00:52:30] Tobias Macey:
And as you have been building these component capabilities, working with teams, helping them come to grips with how to accelerate their work with AI, how to build for AI with Dagster, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:52:50] Nick Schrock:
That's a good question. I think it is very important to balance innovation with change management, and that can be very challenging. You have to be very careful to introduce stuff incrementally, to allow incremental process that provides value at every step along the way. Like, when you're adding new capabilities and you're asking users to do any sort of code change, you have to provide value instantaneously every step along the way and then communicate about it properly. And that can be a challenge. So there's that. And then the desire to incorporate stakeholders and just how universal the need is to incorporate people into the data engineering process continues to kind of astound, actually. You know? Just data is so instrumental to so many people's jobs, and that person who it's instrumental to is, like, the only person who understands it. So it's just allowing them to participate directly into the process reduces so much context switching and painful context switching and collaborative processes that there's just it ends up there's a huge organic demand for, that sort of thing. So I think that that's also been kind of a a lesson learned here.
[00:54:13] Tobias Macey:
And so as teams are trying to tackle their data challenges, they're ramping up on AI use cases, what are the cases where Daxter is the wrong choice?
[00:54:24] Nick Schrock:
Well, we're fundamentally a batch processing system, you know, that can get to semi real time. But if you need microsecond latency, like, we're not the right tool for that. Highly dynamic, you know, kind of computations that require loops and can't be structured into a DAG structure. Systems like temporal are more appropriate. They make different trade offs in terms of, you know, we provide much more structure and constraints built in lineage, all this stuff. And they have a much more complex state machine, but it is more generic, more imperative, and more flexible. So there is that use case as well. So those are the two that come to mind.
[00:55:07] Tobias Macey:
And as you continue to build and iterate on DAXTER, on components, as you continue to keep pace with the demands of the AI age that we're in, what are some of the things you have planned for the near to medium term of DAXTER that you're excited to dig into?
[00:55:24] Nick Schrock:
Well, you know, like I said, you know, obviously, I'm working with components, and I'm very excited for the, this kind of continuing the journey of expanding that ecosystem. And this in app editing is gonna be amazing. And it's in app editing that I can get into because it actually ends up still checking in the stuff source controlled, running tests on it. And that provides us nice layering so that it can make stakeholders handy happy, and it can make the engineers happy. Most importantly, it can get good outcome. So we have an entire long road map on the components front for that. In terms of other things I'm super excited about, you know, DAXR is in a unique spot because we have meta information on the integrations.
We have metadata information on the actual definitions defined in code. We have operational metadata. We have all sorts of very interesting metadata. And I think we can involve DAX or not just to for, like, a useful operational tool, but as a generalized context store for all sorts of tools, including our own across the platform. When we were kinda discussing this component stuff, it's like, okay. Yes. We're gonna design an abstraction that's good for native AI AI cogent, but what else why else do we have the right to win here? And we have the right to win in our opinion because we have a unique view of all the context in the system across every tool, across the way that you define your pipelines and code, all sorts of stuff. So I'm very excited to kind of, you know, not just have AI accelerate authoring of data pipelines, but to have DAGSTER's contextual information power, AI use cases of all sorts. And I think we're in a great place to do that. I kinda jokingly refer to it as the mother of all MCP servers because we can, like, aggregate the MCPs of all our integrations, like, all sorts of interesting stuff, ingest tons of information. Like, our DBT integration, we ingest the full model code. Right? So we can provide that context directly in the same API where we can provide, like, the Python definitions of completely other different technologies, as well as, like, information about when this last failed, as well as information about, like, hey, what things are upstream of this? So we have a great opportunity to be a really compelling context store for AI tools operating on the data platform and across business.
[00:57:43] Tobias Macey:
Are there any other aspects of the age of AI, the impact on engineering practices, the work that you're doing at Daxter,
[00:57:52] Nick Schrock:
or any other related topics that we didn't discuss yet that you'd like to cover before we close out the show? Well, there is any number of topics I have opinions on. But, I feel like we have, you know, we've been talking for a while. I don't wanna overwhelm anyone. So I think we can, wrap it up there. I guess, you know, it's changed, but it is very, very exciting. I might you know, I'm often a skeptic of such things, but, these AIs are really doing incredible things I wouldn't have believed were possible even a few years ago. So it's, pretty cool to see.
[00:58:23] Tobias Macey:
Yeah. I was skeptical at the outset as well, but I have been reasonably impressed in recent months with some of the capabilities and the ways that it can accelerate work to be done. So definitely
[00:58:37] Nick Schrock:
Claude is really good at Cogent now, and I find that ChatTPT, it's the o three model, is really incredible for doing kind of research on the Internet. And, like, I also love now how it's, like, showing you what it's doing. Like, hey. I'm fetching from here. I'm fetching just that that transparency actually builds a lot of trust, so you kind of can guess that it's doing stuff correctly. So, yeah, the the use cases are pretty incredible.
[00:59:02] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your contact information to the show notes. And for the final question, I'd like to get your perspective on what you currently see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:18] Nick Schrock:
Ugh. You really had to ask me that. The, well, it's hard coming on as a vendor and asking that because, of course, the thing I'm working on is, is the most important thing. I guess, to me, the missing thing is that all of the hyperscalers and all the data hyperscalers, they are all trying to build walled gardens. And we're maybe going back to a world where there are Databricks developers and Snowflake developers. And more importantly, there are Databricks companies and Snowflake companies and AWS companies and GCP companies. And I think that is a bad state of the world for the engineers because you want people to be able to move flexibly between different companies and have skills be portable, and we should really still be striving for a world of open standards. So, you know, I don't know if Daxter is the right tool. Well, I I mean, I think we can play a part, but I think there's other parts in the ecosystem that need need to step up too. But I hope we live in a world that's less vertically integrated and more horizontally integrated.
And kinda like anyone who can help out with that by building standards, building open source tools, making that story better, that is great. Because I think, like, a a world of five walled gardens is kind of a sad one.
[01:00:29] Tobias Macey:
Yeah. Absolutely. Well, thank you very much for taking the time today to join me as usual and for all the great work that you're doing on Daxter and making that easier for folks to adapt to the changing ecosystem. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It means a lot. Appreciate you having me on the show. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Nick Schrock and Daxter Labs
Impact of AI on Data Platforms
Daxter's Evolution and New Features
Components and Modularization in Daxter
AI's Influence on Data and Application Engineering
Project Structuring with DG CLI
AI Cogeneration and Medium Code
Lessons from Implementing Components
Future Directions for Daxter