Build Maintainable And Testable Data Applications With Dagster - Episode 104


Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.

Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit for more information.  

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit today to find out more.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications


  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Dagster is and the origin story for the project?
  • In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
  • Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
    • What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
    • What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
  • For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
  • For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
    • Once they have something working, what is involved in deploying it?
  • One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
  • Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
  • It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
  • Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
  • What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
  • Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
  • You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
  • What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
    • What is on your roadmap that you consider necessary before creating a 1.0 release?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?


The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, or you want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn node. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You'll get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering slash Linux that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also Sponsored by data coral. They provide an AWS native server lists data infrastructure that installs and your VPC data coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data corals customers report that their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance. Ragu Murthy, founder and CEO of data core Oh built data infrastructures at Yahoo and Facebook scaling from mere terabytes to petabytes of analytic data. He started data Cora with the goal to make sequel the universal data programming language. Visit data engineering slash data coral today to find out more. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as diversity, cranium, glow, Intelligence Alexey oh and data Council. Upcoming events include the combined events of the data architecture summit and graph forum, the data orchestration summit and the data Council in New York City. Go to data engineering slash conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macey. And today I'm interviewing Nick Schrock about Baxter, an open source system for building modern data applications. So Nick, can you start by introducing yourself?
Nick Schrock
Yeah. Thanks for having me, Tobias. My name is Nick Schrock. I'm the founder of a company called elemental. And our current project is, as you mentioned, this open source framework for building data applications, which is kind of the word be used for describing systems like ETL pipelines ml pipelines, and I'm sure we're going to get into that. Before elemental and Dexter the bulk of my career was spent at Facebook, where I worked from 2009 to 2007 team and I worked on this team through most of my career that I formed called product infrastructure, whose job it was to produce technology to empower our product developers and the users that they serve. And that, you know, that team ended up producing some open source artifacts of No, namely reacts, which I had nothing to do with, but I worked next to those folks for years. And then graph QR which I was one of the CO creators of.
Tobias Macey
And so from that, can you explain how you first got involved in the area of data management?
Yeah, absolutely. So I left Facebook in February of 2017, which is actually a little over two years ago. You know, I took some time off, but I was thinking about what I was going to do next. And I absolutely, you know, started talking to people across various industries, because I was actually looking for kind of almost like a non tech industry to work in that needed tech help. And then as I was talking meaning like a legacy industry, like health care, or finance, or those those types of industries, and as I was talking to people across those across various companies and organizations, I would ask them what their primary technology challenges were. And this data engineering, data integration, ml, you know, doing ml, pipelines, analytics, etc, kept on coming over and kind of kept on coming up over and over again. And, you know, I would then kind of go to practitioners in the field and ask them, like, hey, he showed me what your workflow looks like what your tools work like. And, you know, there's amazing compartments of technology in the sector. But when you look at kind of the developer experience, or what I call the builder experience, because it's not software developers, analysts and data scientists also participate in this from someone with my background, and what I finally call the full hipster stack, meaning like react graph to all the associated technologies, kind of the aesthetics, and tooling is just have not have the quality that I was accustomed to. And then, you know, you will go back and talk to these business leaders after talking to their engineers, and they would say something like, Listen, our ability to transform healthcare, we think is actually limited by our ability to do data processing. And then I remember this meeting distinctly. And I was like, wait, wait, wait, wait, you're telling me. That's what, what's prevent what you think is preventing you from transforming American healthcare is the ability to do the moral equivalent of regularized computation CSV file, and they're like, yeah, that's probably it. And I was just like, This is crazy. And that kind of started me on the path of looking into this.
Tobias Macey
And given the fact that you didn't have a lot of background context in Data Management and data engineering. Before that, I'm curious how you managed to get up to speed and get so embroiled in the overall space of data engineering and data tooling and how you identified where to approach the problem.
I mean, it's just pure immersion. Uh, you know, the I just started reading and consuming as much material as possible and talking to as many people as possible so yeah, I knew is time to go back to work. I was actually on the, on my honeymoon with my wife. And we were in a train and I was reading Mutasa arias of the founder of spark his PhD thesis on the train and she was like, Nick, this is ridiculous, like, put that paper down and you're going back to work when we get home. But so I you know, and and, you know, Tobias, like your podcast and podcasts like it have been utterly invaluable and through those podcasts, I also was able to connect with like minded people and really get their feedback and understand what they were doing. You know, I particularly, for example, loved your episode with Chris Berg, about data Ops, I thought that was super insightful. But effectively, it was just, you know, you just when you start learning something, everyone in the history of the world who knows something, at some point did not know that thing. So you just put one step in front of the other, start reading every single thing out there talking to every person that you know about it, and then just start building and experimenting with stuff.
Tobias Macey
So from all of that you ended up creating the Baxter project. I'm wondering if you can just explain a bit about what it is and some of the early steps of getting along that path and understanding how to approach the problem.
Yeah, so you know, a lot of this comes from my you know, everything is biased and through the lens of your previous experiences. So, you know, I was definitely trying to think about what are the design principles that lead to things like Graph QL that I thought were applicable in this space. And, you know, I started to think a lot about why does programming in data management broadly, why does it feel so different? And it's and what are the properties that make it so that seemingly like software engineering practices end up being different in this domain than a traditional like application domain. And as I was thinking about that, one of the properties of the systems that struck out to me is the relationship between the computation and the underlying data, meaning in a traditional application, you have, let's say, a single database table, and that is manipulated in a transactional fashion, meaning those lots of different pieces of software and entry points into the system that are both better mutating that table, right like this user updates this setting from this, this endpoint. And there's other user updates as other settings from this endpoint dollars that's shared state. What is different and About one of the big things is different about this domain of computation is that typically there's a one to one correlation between a data asset and the computation that produced it. Meaning that if there's a data lake somewhere, and there's a set of parquet files that are being produced, or just say simplified just a part of a file, typically there has only been kind of one logical computation that has been producing that thing throughout time, meaning that you have a function somewhere, let's say just in a very abstract sense, like a piece of computation like a spark job. And it's been producing daily partitions of in 4k files over and over and over again. And there's a one to one correlation between that set of partitions and the data like and the computation that produced it. And you can actually generalize that to you know, almost anything like all of the systems, whether they're and call them ETL, pipelines or ml, supervised learning processes or whatever, are typically just dabs of functions that consume and produce Data assets. And what was really interesting about focusing on the computation itself is that that is actually kind of a more essential definition of the data than the data itself. In some ways. Let me give you an example. Imagine that you had a computation that said that, hey, I am a computation, and I produced a sequence of tools that have strings and ends, right. And imagine that you could actually, you know, in a in a, in a really standardized way instruct that compensation to conditionally either generate a CSV with that schema or a JSON file with that schema. In reality, the is the computation that is the source of truth there and not the produced CSV or JSON file. So it was kind of this like, Hey, why don't we start focusing on attaching metadata, and a type system and a standardized API around these broad computations instead of the data itself. And that was kind of the fundamental insight that led to the project.
Tobias Macey
And in the tagline, you use the term modern data application for Dexter, as opposed to some of the other terms that somebody might use or be familiar with, such as ETFs framework or for building data pipelines. And I'm just wondering if you can describe your thinking in terms of what you mean when you say data application, and some of the main types of use cases that Dexter is well suited for. Totally.
So let's frame this by talking about the term ATL so and again, I think this part of the benefit of me coming in fresh to this about a year and a half ago and kind of assessing like, what are the What's all this different terminology that use and why does it exist? So specifically, ATL, let's talk about that term, extract, transform flow, and its historical etymology is you have, you know, the traditional one is like you have Oracle systems and you have a transactional database on one side and you have a data warehouse on the other side, and every night you do a one time transformation that extract Set Data out of the transactional database does some computation on it and then loads it into a data warehouse. So my what people call, ATL today looks nothing like that. It looks absolutely nothing like that meaning that it is typically multi stage. It has multiple stages and materialisation. It typically passes through multiple different computational engines, like the ingest might be through Kafka or tool like five tracks, and then it might be in a data warehouse for a while or maybe then spark will operate on it and then different systems. So the term ETL is no longer no longer attached to its original definition, when people say ETL today be effectively mean any computation done in the cloud. And the other thing, which I think and the reason why we're kind of interested in capturing a new term called Data application for these this one, I believe that ATL data pipelines supervised learning processes are all in effect the same system. They are graphs of compute that consuming produce data asset, right within every ml pipeline is the detail pipeline, there's just one additional step that produces a model. And the other thing is that, you know, the and this also kind of comes from the origin story is that I really view this domain as as in a similar spot to where front end engineering was about 10 years ago. And back then, yeah, and the reason why it comes from this is that if you talk to anyone in data today, they'll say something like, I spend 10 20% of my time actually doing my job and 80 to 90% of my time data cleaning, and I was when I kept on hearing this from people that actually kind of gave me started give me these like flashbacks to talking to front end engineers and say 2010 within Facebook, and they would say like, I spend 80 to 90% of my time fighting the browser. And 10 to 20% of my time building my app. And, you know, React really changed the world on that front. And one of the things it did is it no longer thought of front end as kind of a sequence of scripts that are stitched together, and then you touch ones and you never talked to them. Again, it's like, hey, these are not really complicated pieces of software, we need a framework that respects the problem and the discipline and like, is lives up to the inherent complexity, it's in those apps, I think that data is in a similar spot. Whereas like, these are no longer just like five scripts that are stitched together in a DAG that you have to run once a day like those exist. But in reality, we are in a much more complicated world, the ETFs what we're hitherto known as ETFs. pipelines are much more intermixed with the business logic of your application, meaning like often you're doing transformations that streaming data back into the app and there's kind of a reflexive relationship. Ship between the data pipeline and the core behavior of the application, your front end application. So these are just like far more complicated things now. And I think you need to think of them as applications meaning like they have their they're alive all the time. They have up times, there's complicated relationships within that you have to think about and not just as a one off script, but you have to respect the problem, right testing for it, really start to model these things in a more robust way that's amenable to both human inspection, human authoring and tooling. And so you know, that's kind of why we're referring to those things as data applications. Because data applications are multi disciplinary application, whether it's like authored by analysts and data engineers and data scientists, and I think that the the siloed in that world is partially caught. By the siloed lean of terminology, actually, but they're all collaborating the same it to me.
Tobias Macey
And I would also argue that application engineers are starting to bleed into the data Engineering Lifecycle as well, where previously, the detail engineer or a data engineer would be responsible for pulling information from the system of record that the application uses. But as we get to more real time needs and the requirement of incorporating data as it's being generated, the application engineer needs to be aware of what the overall systems are, that are able to process that data downstream, particularly with the introduction of systems such as Kafka as the sort of centralized system of record for the entire application ecosystem, both for the end user applications and for the data applications. And so I think it makes sense to have this unified programming framework that everybody can Understanding that everybody can work together on rather than having them be component eyes and monolithic and fully vertically integrated.
I couldn't agree more. And the the days, you know, this is why I mentioned it earlier the your interview, Chris Berg, about data Ops, you know, you're kind of in different words describing the data ops vision of like, you know, there used to be this, you know, if the analogy is DevOps used to be the silos between developers and operations. And now developers are responsible for operations to some degree, in that there's like a programming model where they can program the ops. And I think we need to move to a similar world here, where you can have self contained teams that are responsible for building the app and building the and deploying it and also integrating it with your data applications internally. Because the people who wrote the apps know the most about the domain of their data, we shouldn't be living in a world where an application developer can wake up willy nilly and change their data model and then break everyone else without being responsible for that.
Tobias Macey
So can you take a bit of time now to talk a bit about how Dexter itself is architected? And some of the ways that it's evolved since you first began working on it?
Totally. So you know, Dexter, you know, if if you look at it, I think someone once called me told me, he's like, Oh, this looks like fancy Luigi. So you know, at first blush, it definitely looks like a fairly traditional ETL framework. I think what distinguishes it, and what how it's architected is that we are very focused on allowing the developer to express what the data application is doing rather than just how it is executed. So if you look at something like air flow, right, the primary abstractions, there are just they have operators, which then you create tasks, and then you build a dependency graph, right? If you open up that UI, right all you C is kind of a series of nodes and edges. And those nodes have like a single string that describes what it is. And then there's edges between them. And that's all the information that you have. And the prod The goal of the system is to orchestrate and ensure that those computations complete, and that you can retry them and things of that nature. Dexter's primary focus, although we do do some of that execution is will get involved. But the primary focus of Dexter is enabling the developer to express it a higher level of abstraction, what's those computations are doing. So when you write a solid, it's a general You know, there's a type system that comes with Dexter. So every single solid as a function, we say that, hey, every single node in the graph is actually a function that consumes something producing something and you should be able to express that you should also be able to overlay types on top of that, so that you can do some data checking as things enter and exit the nodes. As well as expressed a tool and exactly what is going on with this thing. These things can also express how they get configured. We have strongly typed configuration tools. And then as the computations proceed, they actually inform the enclosing runtime about what's been happening, meaning that hey, I produced this output, hey, I actually created a materialisation that outlives the scope of the computation. I just passed this data quality test, etc, etc. So our focus is much more on kind of this new called the application layer, right for data management.
Nick Schrock
And that is the kind of the primary focus for our programming model.
Tobias Macey
And since Dexter itself is focused more on the programming model and isn't vertically integrated, as I mentioned before, as opposed to tools such as air flow that people might be familiar with. I'm curious how that changes the overall workflow and interaction of the end users of the system. And what your reasoning is for decoupling the programming layer from the actual execution context. Yeah. So,
you know, the world of infrastructure is changing a lot. And you know, what we want, you know, air flows into this is talk about air flow specifically, right. You know, airflow is very vertically integrated system. And it, they are, it has a UI, it has an execution, like cluster management aspect to it. And it also, you know, has this user facing API such as such as it is, and, you know, because it's not layered as much they haven't been able to move as quickly. Like, for example, they, you know, airflow still doesn't really have a coherent API layer, such that you could build. Yeah, really move quickly on the front end of that. system in a decoupled way. But I think what's more interesting is that the world of infrastructure is just changing a lot. And, you know, just to go back to the previous comment about that, like Dexter's primary concern is about what the what the data applications are doing, rather than exactly how they're doing it. The How is what the systems are going to be doing is going to be changing a ton over time. So I think there's going to be lots of different physical orchestration engines as new different cluster computing primitives come along. So you know, just for example, there's like, out there there's das, which you can use for cluster management if you just want to kind of do Python native. And then obviously, people are really interested in interested in computational workloads on Kubernetes. But I don't think Kubernetes will be the end all of all, you know, infrastructure for all time. And so I just think that world is moving very, very quickly. And you want to be able to also be able to use a new software abstraction on existing legacy infrastructure. Right. So this is this, in some ways comes from my experience working in with graph QL. And one of the things I was really pleasantly surprised about open source in graph QO was just how effectively it penetrated legacy enterprises. And the reason why is that graph QL is a pure software abstraction, that you can overlay on any programming language, any runtime, any storage engine, any RM. And that meant you if your front end people with graph QL wanted to go in and use graph QO, but overlay it on top of some legacy, IBM web sphere something or other. You could actually have someone write a graph to your server which interacted with that thing. And that was an extraordinarily powerful operating modality for abstraction to really have a lot of impact not just amongst the communists empty building Greenfield apps, but in an industry wide scale. So we kind of approach this in the same dimension of having what we like to call a horizontal opinionated platform. Meaning that yes, these are just tags of functions. And by functions, I mean like a coarse grained computation, you like a spark job, a data warehouse job, or any sort of legacy process that you have in your system. You should be able to orchestrate those computations on arbitrary compute based on your needs and your requirements. And the but regardless of what's actually doing the compute, and what's physically doing orchestrated, there's still a ton of commonality between all those things and that's where kind of the, the what part of me of what Dexter is describing me in like it has types metadata, etc. And we have common tools that can operate over all of that. And you can just actually see this trend of moving away and unbundling Integrated stacks kind of cross a few domains of computing, all the way from content management systems to other systems. And I think this is kind of part of that.
Tobias Macey
Yeah, I think having these different compostable layers provides a lot more longevity for each of those different layers independently, because as you said, Today, you might want to use airflow as your actual execution context, tomorrow might be evolved a spark because you're scaling needs to be evolved. Or maybe there's some new framework that's coming out that you want to be able to leverage but you don't want to have to rewrite all of your computation just because you're running across the different actual execution engine.
Yeah, and if there's nothing else, the other thing that I really noticed coming at this industry fresh is just how heterogeneous and fractured it was. Meaning that in when you have teams building these data applications, even in the simplest case of like a kind of a coherent or typical European Crossing three or four technology boundaries with dealing with these
Tobias Macey
things. And it'll a legacy organization where there's complicated or maybe they've done acquisitions and stuff, the data, the data infrastructure heterogeneity is absolutely mind boggling. So having this kind of like this single opinionated layer that all it does kind of does describe what's going on and make it in a way that can integrate with legacy both computational engines and legacy infrastructure, we think is really powerful. And just quickly, I'm curious to what your evaluation process was to determine that Python was the right implementation target for Dexter and what other Language Runtime or frameworks you might have considered in the process.
So it wasn't actually you know, I'm with when it comes to languages. I'm much I'm a pragmatist and for these type of systems where you want a wide variety of personal Interacting with it successfully, but still being able to build so called real software. And in the data domain, I don't think there's any choice but to use Python. Python has a lot of good things going for it. One, just like everyone in data is accustomed to it. It's highly expressive. So you can with it. This is a very for these kind of metadata, meta programming type of frameworks extremely useful to use Python also. And I think that's the reason why it's been successful in this domain is that as a programming language, it has just a vast dynamic range. Meaning that I think you can grab anyone who say, is proficient enough to do something complicated and Excel, and you can put them in a Jupiter notebook and they can do meaningful work. But you can also build Instagram on top of Python. And that's kind of Python superpower. So that you know, what are the other choices in the data domain, you know, you can live Scala is info Scala does not have that dynamic range. You cannot plop an Excel user into a scholar program and expect them to be successful. And then, you know, they're really, you know, what other languages should I have chosen? Or should have considered actually, you know, it's one of those things where I don't think I even like really considered another language because to me, the choice was so obvious.
Tobias Macey
And then going into Dexter, I'm curious, what were some of the main assumptions that you had? and have those been challenged or updated as you have put Dexter in front of more end users?
Nick Schrock
Yeah, that's a great question. And actually, it's kind of difficult to go back in time and reconstruct Exactly. You know, what my thinking and then the team Thinking has been. And every step along the way. I think, you know, most recently, I actually think I still think this is the correct architectural decision. But initially, we were very focused on the kind of the use case of, Hey, I'm a team, I have an air flow cluster, I want to, you know, have a higher level programming model on top of airflow, such that my team is not our people on my team are not manually constructing air flow tags that are programmatically generating those tags from some other API. In our case, Dexter, because it was a pattern we saw over and over again, typically, most airflow shops have sufficient complexity have built their own layer on top of airflow that, you know, for whatever reason that's specific to their domain or their context, actually, programmatically generates air flow back. So we were really focused on the incremental adoption case, but our early users, a lot of them came to us. They're like, Hey, you know, I think it's really cool. You have this air flow integration and actually kind of proves that the system is interesting and generic and that we won't be locked into anything. But for right now, I just really liked your front end tools. And I just want to be able to build kind of like my Greenfield app on top of this and kind of a one click, you know, one stop shop sort of way. And that's actually, what we've been working on for the last couple months is coming up with a like, you know, a kind of Dexter native, you know, vertically integrated instantiate of a Dexter system that has a scheduler and a lightweight execution engine, along with some DevOps tools. So you can just, you know, essentially like, right, we call it, you know, we have a library called Dr. AWS Baxter at us and then and it spins up a node and AWS for you spins up an RDS database and you're kind of you can go from hello world to schedule and a job and about two minutes with a beautiful You know, hosted web UI to monitor Production eyes are your apps. So, you know, we kind of started out with this, you know, horizontal integrate with everything approach. But don't be super opinionated, to actually we do have like one instantiate an opinion, which is like you can, you know, have this kind of like out of the box solution, but the architecture is still there to integrate it with other systems and other execution context. So, you know, I think we've changed our initial target market at first. And then I would say the other thing is that related to that is that, you know, this started out as a much more kind of vanilla ATL framework. And the the insight that allowed it to eventually target the different execution engines has definitely been an evolution in order to kind of so that that thinking is definitely changed along the way, but I would have to think about other things in order to answer that question more fully,
Tobias Macey
and then for somebody who wants to Extended Dexter and either integrate it with other systems that they're running or add new capabilities to it or implement their own scheduler logic. What are the different extension and integration points that Dexter exposes?
Sure, so we can go those one by one. So for example, if you want to use a new say Compute Engine like a new spark, let's say you're using spark but you really want to experiment with say there's a there's a new kind of not spark successor but similar system that does distributed computation called reg for example, it's like okay, I want to write I want to be with us re within Dexter, well, all you would do is you would kind of write one of these, what we call solids that generically can kind of interact and, and rap kicking off a rage job, and all you need to do is kind of look at the way we integrate with spark and data warehouses today and kind of use those Is patterns in order to build your own your own solids and you're off to the races. So literally wrapping existing computational frameworks is relatively straightforward. And you can cargo called that from our open source repo. Another example you had was say, I want to be able to use my own scheduler for whatever I want. Well, the Dexter is fully built on a graph QL API. So the system is very pluggable. So it actually be very straightforward to kind of implement your own schedule a logic because all you need to do is, you know, based on some schedule, essentially execute a graph QL mutation against our hosted installation, or your hosted installation, you'd be able to NQ jobs to be run in terms of you know, if you wanted to execute this on a new orchestration engine, right. We also have kind of a pluggable API for that and you know, All those examples are also checking door open source repo. So right now we have integrations with task and with air flow, where effectively we've written code that allows you to take a Dexter representation of a pipeline, and then effectively compile that into either an air flow dag or a gas tag. And you would, you know, if you wanted to use another execution engine or to do that you would just kind of mimic that process. So the system is designed for plug ability through and through
Tobias Macey
and another component of a data environment that somebody might want to be able to integrate Dexter with is their metadata engine to be able to keep track of data provenance and being able to identify what are the transformations that are happening, and I'm curious what would be required for somebody to be able to extract all of the task metadata to integrate into that system.
Nick Schrock
Yeah. So that's a great question. You know, the the system is definitely designed with that in mind, meaning that, you know, you, whenever you execute a solid, you know what the inputs and output types are. But in addition to that those solids can also communicate that, hey, I have created this what we call materialisation that will outlive the scope of the computation. So you can subscribe to those events via graph kill subscription, or you can just consume them with our Python API. But what that allows you to do is the a tool, which is consuming those stream of events has an enormous amount of context by what's going on and knows like when the thing was executed, it might know what container has been executed, and it knows what configuration file was used, meaning like the data configuration file was used to kick that off. And then it gets runtime information about the materialisation and it's a total user pluggable, kind of structured metadata system. So yeah, definitely on our roadmap is for us to build our own meta store on top of this, but it's meant to be very pluggable where you could just write a generic facility which consumes these events. And every time a materialisation is consumed, you would be able to actually persist in a meta data store enough state to have full lineage of provenance on that produced materialisation. So we don't have any out of the box. We don't have any out of the box support that right now. But it would actually be pretty straightforward to integrate that with an existing meta store. And we artist really excited about that direction. So if anyone wants to do that, please come talk to us. Because we love working with people who like to build on top of this. And for somebody who is interested in getting started with Baxter and writing their own data flow or data application, can you talk through the overall workflow for somebody to be able to Define all of the different computation points and integrate it and then deploy it to production and make sure that the execution contexts are configured properly.
So, you know, I'll just go
through that quickly. So you know that you know, so what do you start with YU pip install Dexter? Right, it is just a Python module. And what you would effectively do it to say hello world, you would have a Python file, and we you would write what we call pipeline, which is one of these tags and then a solid, which is just effectively a function is to find the computation. So you write a function, that function is totally black box, you can call you can invoke pandas a data warehouse job, a spider postcard job, whatever you want, and then you orchestrate. We have this kind of elegant DSL for stream those solids together into a DAC. Once you do that, then what you can do is you can launch that and debug that with either in a unit testing environment, obviously, but also using our development tool called Daggett. So just locally on your machine without deploying anything, you can run Daggett you can visualize the DAG, inspected, you can configure an execution of it, we have this beautiful auto completing config type system, you can then execute that locally and verify that things happen. So, you know, the this system, the fact that we've architected it to be executable in different contexts, means that also executable in your local machine for test ability and whatnot. And we have another different we have also kind of abstractions that help users isolate their environment from their business logic, because this is just critical for getting testing going. OK, so now you have that working the deployment, you know, with our new release, you can actually deploy that in a very straightforward fashion by kind of using our kind of DevOps tools that come with with this. So Once you have that pipeline written, you would then effectively type Dexter at us and net, and then it would provision an instance, install the correct requirements. If you have, you need to have a requirements txt locally. And then then you're up and running in, then you're up and running in the AWS environment in your VPC. And then you can, you know, we also have a Python API for defining schedules, which is just a light wrapper on cron. And so you can go from kind of like writing this to, to also deployment very quickly. If you further want to customize that, then you can actually what we we have this kind of new abstraction that we call an instance or you can think of is like an installation. And you can configure that thing to instead basically you think of it when you admit that ETFs environment or knit your local environment, you can say like, hey, instead of just doing a single threaded, single process, execute, or we instead want to execute this thing on top of that For example, and so you could, you would also configure your instance, which is just a demo file and kind of a well known spot in order to instead of using our native toy execute or use something like task. So it's
Tobias Macey
definitely pluggable on multiple dimensions. And then one of the things that you commented on and that stands out about Baxter is the concept of strong contracts that it enforces between the different solids or computation nodes. And I'm wondering why you feel that those contracts specifically are necessary, and some of the benefits that they provide during the full lifecycle of building and maintaining and building and maintaining a data application.
So this is what struck me about a lot of these systems is the amount of implicit contracting that was in these systems and how frequently unexpressed they were, meaning that again, contrast to airflow, right air flow. If you look at their documentation, they say the If you feel compelled to share data between your tasks that you should consider merging them. And then actually someone wrote a system to try to pass data between tasks called x com, but that is generally not used that much. And I believe even the creators of it, it's kind of like been like add that was kind of a swing and a miss. And so but the thing about that is that airflow tasks are passing data between another implicitly, right? So if you have task a which comes before task be presumably the dependency exists only because a has changed the state of the world in such a way that it needs to happen before be right. And because and if you want that to be testable, it has to have parameters and you have to pass data between them. So to me, this wasn't some massive realization or I think everyone understands that there's data dependencies between these things, it's just a question of whether you express them or not in the system, I think it is critically important to express them for any number of reasons. You know, both in terms of human understand ability mean, like you can actually inspect this thing and a tool and understand what the computation is doing to ensuring or guiding your users to write these things in a testable manner. Because if you can't pass data between tasks, there's no way you can test those tasks. Right, a and I just think that these all these data applications are dads have functions that produce and consume data assets, and that they should be testable, you should be able to execute arbitrary subsets of them. And in order to do that, you need them to be parameter eyes, and some of the parameters need to come from outputs of the previous task which means add function. And then also there's, you know, really interesting operational properties that come out of expressing your data dependencies. It's a fundamental layer on which you could build, say, fully incremental computation and have the system understand how it should memorize produce data and other aspects. And you know, this whole kind of, you know, max of airflow flame, Max Beauchemin, who's also been on your show, you know, has written a couple blog posts about this, which kind of you know, has been influential in my thinking about so called functional data engineering. So I just think it's the right way to build these systems on any number of dimensions. And I think you can get a lot of value by expressing those data dependencies, those parameters in your computation.
Tobias Macey
And you mentioned testing a few times in there. And I've got a couple of questions along those lines, one in terms of how Dexter itself facilitates the overall process of testing and some of the challenges that exist for testing data applications, and how you approach it. But also, I'm curious how you approach defining the type system for Dexter to be able to encapsulate some of the complex elements that you need to be able to pass between things such as things like database connections to be able to identify that there was some change in record set or an s3 connection for being able to define the fact that there were some park a files dumped into a particular bucket.
Okay. Well, I feel like you just asked to, those are two questions, we can fill up an entire podcast arms, but I will do my best. So you asked me how Dexter approaches testing. And this is a huge and important subject. Not that tough Dexter does it but the testing in the kind of data domain in general, everyone acknowledges that's really difficult and hard. And it's one of the things that I really noted when I first was learning about this. So in terms of you know, things Does that make this domain different from application programming, you know, the same developer operating within an application, a traditional application would be like writing lots of unit tests, you take that same human beam, move them to writing one of these systems, and all of a sudden, they're not writing tests anymore. Because it's a fundamentally different domain. It's harder to do testing on. So when I think about testing in this, I think about kind of three different layers. Let's go through them. So one is here to testing the others integration testing, and then we'll call it pipeline or production testing. And each of them in this in this environment has their own issues. So unit testing this stuff is hard. And a big reason why is typically, these systems have dependencies on external pieces of infrastructure, which are effectively impossible to mock out or very difficult to mock out. You know, this is one of the reasons why we built in system where, you know, one of the things that we do in Dexter is that we flow around the context object throughout your entire computation. And the goal of that is instead of anywhere where you would just kind of Glatt grab some global resource, like a connection from a file, like a database connection with hard coded values, or a spark context, or whatever, you instead, attach those things, those same exact objects to our context object. And that allows the runtime to control the creation of that context. And therefore kind of with the same API, be able to control the environment that our user is operating within. So what I mean by that is, instead of just saying like, you know, global function, get con and get connection, you would instead say context, dot resources connection. And what that allows you to do is based on how you configure your computation in any specific instantiate, you can kind of swap In a different version of that connection, so that you can test this stuff in a unit testing context without chasing the business logic, and, you know, so and but the thing about the data domain is that you can't capture as much stuff in unit testing as you could. And probably application development because it's external dependency stuff, but you can still do a lot in the unit testing environment, you can make sure that like a refactoring process work, where if you renamed a function, you know, if you testing out that your configurations are like actually like being parsed correctly, there's all sorts of changes that can be covered there. And I think it's critical part of the process and a CI CD pipeline. Okay. Next, integration testing. So this is more like, Hey, we can't mock out our Spark cluster. Because actually like mocking out a Spark cluster would be an entire company. That's an entirely complicated software itself. But what we do want to be able to do is easily parameter eyes, the computation so that hey, and the integration test environment, we actually spin up a like very tiny test only Spark cluster and ensure that we can, you know, run it on a sub sample of the data and still have some thing happened in the verifiable way. So that integration testing layer, that's well, I'll emphasize on our built in configuration system. So in order to make these pipelines testable, typically, they become extraordinarily complex functions with tons of parameters that configure both how they're interacting with their environment, and also where they're getting their data from. And so we fully embraced that. And we built this kind of configuration management layer that makes it easier to manage complex configuration. And one of the goals of that is to enable better integration testing. So that in your ci CD pipeline or locally, you can kind of have different instantiate versions of config that We'll do full or partial integration tests as your pipeline. And that way you can kind of slice and dice your pipeline into whatever subset you want of, and then be able to execute it within different environments. And then the last component of testing in these data pipelines, which we also full support for within Dexter's inhibition of pipeline or production tests. Now, this thinking in this area, I've deeply influenced by a gong, the creator of grades mutations, who kind of uses the term pipeline testing are described us. And what it says is that, you know, one of the differences between this and traditional programming is that in data pipeline or applications, you do not have control over your input. Instead, you're typically ingesting data from a data source you don't control and, you know, that means that if the data changes some assumption about it changes You know, like, think about this you like you look at some CSV file, you write some code, and you're parsing a CSV file, there are a bunch of implicit assumptions in that code that you wrote in order to load it incorrectly. And then in the next day's data dump some assumption about that changed, and it's not part of the formal contract, your code will break. So what expectations are for data quality test is the notion that, hey, instead of having these like implicit contracts between your incoming data and your computation, which only end up getting Express when things break, instead, why don't you front load that and say that, hey, in order for this computation to work, the incoming data has to conform to this data quality test mean like
I expect the third column to be named foo, and for at minimum 1% of them to be no and the rest of them to be integers greater than and because you cannot control the injustice. The only way you can test that and know that it has happened was that production time. This is much more like a manufacturing process than a traditional software process. Meaning that the data is the raw material, you're getting it shipped from some place. But before it goes into the machine, you need to do tests on that raw material to ensure that it conforms to the requirements of the machine that you might just be breaking right now. So, you know, we, through our abstractions, try to guide the developer and with tooling built on those abstractions help the developer to kind of execute all those three layers of testing, which are all necessary for a well functioning system. I believe your other question was about our type system and things that you pass between the silence Correct, correct. Alright, so the type system, Dexter is a good way to transition from those production tests. Because when we went into this, it was like, you know, because of that, that property that was talking about, we typically don't control injust and this kind of like the vast heterogeneity of systems that are used to process this, what can the type system in one of these things that proclaims to, you know, span programming languages and spam different computational systems? What What can it actually do?
Nick Schrock
And we actually came to the conclusion that for now, like the most simple and most flexible thing for a type to say, indexers when you say, Hey, I have a solid, it has an input of type food, all that type food says at its core, is you provided a function that says, hey, when the value is about to be passed into that solid, it needs to be able to pass this test. So literally, the core kind of capability of any type in the Dexter system is that there's just a function that takes an in memory value, doesn't arbitrary computation, and then returns true or false, plus some metadata about what happened. So it's a total a dynamic, flexible and gradual typing system that allows the user to kind of customize their own types and do whatever they need to do in order to pass that type check. The type system is in is and the inputs and outputs all are about the data that is flowing through the system. The other things you mentioned, though, were database connections, s3 connections, things of that nature, those we model on a different kind of a different axes or dimension of the system that we call resources. And I was mentioning that like context object that we flow through the entire system, a database connection, or s3 connection is something you would attach to that context. And where the vision really goes here is that we want to have an entire ecosystem of those resources so that people aren't even thinking in terms of higher levels of abstraction. So like, let's take s3 one or 2 million people. Using us three, four, well, maybe what you're doing is all you're doing is saying that, hey, in previous parts of this computation, I'm producing a file, and I just need to stash it somewhere and have it saved. So that later down the line, a solid can like, take that and then do some further processing on it, if you think about that abstraction, right, like, I think we call it file cache, we have like a file cache abstraction that comes with Dexter. And there's like a local file system implementation of it. And there's also an s3 implementation of it so that you can do that operation of stashing file somewhere for you know, to in order to actually perform your business logic. But locally, you can just kind of say, Hey, I'm operating local mode, use the local version of that file, stash resource, but then in production, its operating a cluster community environment, give me the s3 or GCS version of that same exact abstraction and so then things that Like s3, connections, database connections and the things stacked on top of that we model as resources because they're not business logic concerns, their operational concerns. So our goal is to kind of have the type system and the data quality tests be about the data, the meaning of the data that's flowing through the system and have the context and the resources aspect be about more environmental or operational concerns.
Tobias Macey
And on top of all your work on Dexter itself, you have created a company to be sort of the backstop for it in the form of elemental, and I'm wondering how you're approaching the overall sustainability and governance of Dexter and what your path to sustainability and success for the business happens to be and how they relate to each other.
Yeah, that's a great question. And I think about the stuff a lot. You know, the open source government stuff is top of mind for me, actually, because about a year and a half ago, Lee and I Dan, the graph co creators kind of spun graph, graph QL out of Facebook, and started the graph QL Foundation, which is now run in concert with Linux Foundation. And that's been a really interesting experience. And then there's also been kind of a lot of, for lack of better term hoopla around open source sustainability across many dimensions about should there be new licensing regimes? What's the relationship between the cloud vendors, what is proper governance? So my belief is that I like to have a pretty clear wall of separation between an open source ecosystem and any commercial entities stacked on top of it or associated with it. So I deliberately chose the name elemental to be different from Dexter. And my goal in terms of how they're related is that the relationship between Dexter and elemental I hope will be similar to the relationship between GitHub and get structurally meaning that Dexter will be an open source project that will Forever and always be free time any type of thing where we just kind of flip feature flags and have enterprise features for it. It'll be a self contained governable open source project with very well defined properties and very, very well defined boundaries such that in the future, we can have a neutral governance model that will work well simultaneously though, we're also trying to build a sustainable business and unhealthy business. And that's also our goal. And that's where elemental comes in. And the reason why I like to get up get analogy is that GitHub is a product, they chose to make a bunch of it closed source, it's hosted, there's a login, you have users that do stuff, people are happy paying for it. it leverages the success of good, right, but it has its own product dynamics and whatnot. It's not just pure hosted good. And then there's this cool dynamic right where GitHub made it easier to host the Which actually increase the adoption of get may get more of the obvious winner, which then increase the popularity of GitHub and there's kind of this reflexive relationship. This is what we want to do eventually with the product that will become elemental cloud, or whatever you end up calling it, and Dexter, so elemental will eventually be a product that would leverage the success of Dexter, meaning that if your team has adopted Dexter's a productivity tool, it will be natural, compelling, and in everyone's best interest to adopt elemental as your data management tool that leverages the adoption of abstraction. And I think that's like very everyone's incentives are aligned. If you do that well, and you know, you can kind of clearly communicate to your users that like you're not going to be hoodwinked into being if you're just using this for a pure productivity tool that's totally fine with us and Godspeed. And it's our job to build data management tooling that leverages that such that enterprises that, you know, contain or employ that developers use Dexter feel really good about having commercial relationship?
Tobias Macey
And in terms of the future roadmap of Baxter itself, I'm wondering what listeners should be keeping an eye out for and what elements of that roadmap you consider necessary before you are comfortable cutting a one.oh release? Oh,
the the ever present question of a one dot o release? You know, to me just Yeah, the the future roadmap, you know, I certainly think that you will see us one, you know, effectively based on dynamic user feedback, kind of prioritize integrations with specific parts of the ecosystem. So after this release, I will imagine that the tools are will look more compelling to people, in which case certain aspects of the training, they'll be like, Hey, I understood that you had this air flow integration. I'm really interested in using this other tool that I see my friend using, but I still can't use move my company over, off of airflow, one whole shot. What can I use as a value add for this over airflow. So we anticipate kind of maturing our integrations with other different technologies, but that will be based on user demand. I think the other thing is that you'll see us building kind of more and more tooling off these higher order layers of the computation to be able to say, like, Hey, I did this data quality test. I produce this materialisation. So that means, you know, you could say, you know, you can name off any number of things you can do based on that, like a meta store, anomaly detection, data quality dashboards, all sorts of other stuff. But I think for our next you know, one to two months, it's going to be a very kind of more meat and potatoes type time where are based on feedback based on ergonomic issues based on operational issues that come up, we will be evolving the program model or documentation and kind of you Doing a getting back to basics type in terms of a one release. To me this is about this is mostly about communicating expectations to the users in terms of like, Hey, this is like an API, we're going to stand behind for years and really commit to backwards compatibility and an accessible in a like, really, really serious way. And, you know, we're super confident that this is like the base API layer for the future of the system. We still have a few kind of iterations to get with that we're not going to be breaking people willy nilly on this stuff. But I suspect that based on like, based on user feedback, and how this stuff gets evolved, that gets us organically around this process that we will be changing some API's and maybe even like taking the system in different directions. So to me, the whole one dot o is mostly about external communication and about expectations for feature users and it's more of a qualitative judgment than anything else.
Tobias Macey
Are there any other aspects of Dexter or your work elemental or your thoughts in the space of data applications that we didn't discuss yet that you'd like to cover before we close out the show?
Um, I guess from one aspect of one thing I'll say is that I think most of these systems and this goes back to kind of like, what's the deal with this new terminology aspect, that is that they over specialize. So there's lots of people who build ml experimentation frameworks, for example, that are totally and wholly separate from their data engineering practice. And all these things end up having to coexist within the same data app anyway. And so I think a lot of these tools are overly specialized. Like one thing I'm really excited about in terms of tooling will be able to build is that it will be it will be very straightforward to build no experimentation framework over Dexter, because you can use an API to NQG jobs with different configuration parameters, right, which is what you need to do in order to say do a hyper parameter search or things of that nature. And, you know, you should be able to use effectively just a lightly specialized tool over the same ecosystem to do MMO experimentation, rather than using an entirely different like domain of computation. So yeah, we just we very deeply believe in this kind of, you know, multiple disciplinary aspect of it. Like one other integration that we didn't really talk about is that TechStars, a first class integration with paper mill, which I believe you've done an episode on, and that what that system does, it allows you to turn a notebook into a function, a Jupiter notebook into a course brain function effectively, and then we in turn, allow that to easily wrap that within a solid. So, you know, you know, I guess what I'd like to emphasize is that the multidisciplinary aspect of this that's a way for people to describe and package their computations in are actually encoded in different system but express them in a similar and wrap them in some or metadata system. In the same vein, we actually have kind of a prototype quality implementation, integration with DB t as well, where you can have a DB t project, which is authored by an analyst or a data engineer and then wrap that as a solid. And then you can execute it within the context of one of our pipelines. And that solid will communicate, hey, this DVD invocation produce these three tables and these two views, etc, etc. So yeah, I think that, you know, this, we need this sort of unification layer. And that's what
Tobias Macey
we're trying to do. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Yeah, I mean, I guess it's pretty self serving. But if I it would be an issue if I was working on something, and then thought the answer was totally different from that, you know, I guess, you know, from what I see the gap is, yeah, the gap in the ecosystem is somewhat Dexter shaped, I'll say, meaning that we just like, I don't think the gap is the one next good cluster manager or just the right ATL framework that's drag and drop abstracts away the programmers something. This is a software engineering discipline. And so I guess I'll just kind of answer the question is like the biggest gap and tooling is not trying to is just in the abstract tools that instead of trying to abstract away the programmer, and try to instead try to kind of more upscale people who don't consider themselves programmers to participate in the software engineering process and to really treat the same series As applications and not as kind of these one off scripts or something that you just went on one of like drag and drop once and be done with it. This is why one of the reasons why I'm such a huge fan of DBT, because I think one of the reasons, one of the things they've been able to do is take people who don't conceptualize themselves as software engineers, analysts, and essentially, through a really nice product be able to allow those analysts to participate in a more industrial strength software engineering process. And I think that direction is super exciting. And we're trying to do that and trying to enable those type of tools with Dexter.
Tobias Macey
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Baxter. It's a tool that I've been keeping a close eye on for a while now. And I look forward to using it more heavily in my own work. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day.
Nick Schrock
Thanks, Tobias. Thanks for having me.
Tobias Macey
Thanks for listening. Don't forget to check out our other show it at python To learn about the Python language its community in the innovative ways it is being used. Come visit the site at data engineering Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts EPS data engineering podcast com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!