In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what durable execution is and how it impacts system architecture?
- With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?
- One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.?
- What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects?
- Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?
- AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?
- What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?
- What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?
- When is Temporal/durable execution the wrong choice?
- What do you have planned for the future of Temporal for data and AI systems?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
- Temporal
- Durable Execution
- Flink
- Machine Learning Epoch
- Spark Streaming
- Airflow
- Directed Acyclic Graph (DAG)
- Temporal Nexus
- TensorZero
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. WHOOP and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.
[00:01:58] Preeti Somal:
Your host is Tobias Macy, and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures. So, Preeti, can you start by introducing yourself? Hi, everyone. Glad to be here chatting with you today. My name is, Preeti Somal, and, I run engineering at Temporal Technologies. We are the pioneer behind durable execution. Prior to this, my background has been a lot of enterprise software, most recently at HashiCorp.
[00:02:30] Tobias Macey:
And do you remember how you first got started working in the area of data?
[00:02:35] Preeti Somal:
Yeah. I think my first, sort of exposure to data was actually at Yahoo. I was at Yahoo for four years, and that's when I learned all about the wonderful world of Hadoop and data. I was more on the systems management side, and one of our sort of dreams was to get all of the monitoring data into the Hadoop clusters and kinda see what magic comes out of that.
[00:03:00] Tobias Macey:
And now you're working at temporal, which is definitely in the forefront of this space of durable execution and fail proof application design. And I'm wondering if you can just give a bit of an overview about what that term means when somebody says durable execution and some of the ways that it changes the way that people think about their overall system architecture.
[00:03:23] Preeti Somal:
Absolutely. So the way we like to explain this is what if your application crashes and the crash is inconsequential? And that's exactly what durable execution gives you. The goal here is to offload the developer, the engineer from all of the heavy lifting that goes into building reliable, resilient applications and take all of that work and deliver it through a platform. And that platform is temporal, and that's the core of the durable execution value that we are delivering.
[00:04:02] Tobias Macey:
And so in terms of that resiliency to errors and failure, obviously, not every error is one that you can automatically recover from, but many of them are just due to transient issues, whether those are brief network outages or an application node going down to get restarted because it's doing a rollout of a new version, or maybe there is just a configuration error that needs to be addressed, and then you can retry it. And I'm just wondering if you can can talk to some of the ways that as you are building on top of something like temporal, it forces you to think about what are those different failure modes and what are the appropriate behaviors.
[00:04:43] Preeti Somal:
Yes. Absolutely. So I think the the main thing to talk through here is temporal is a programming model, and our approach has been to really build idiomatic SDK support. So we have support for temporal in multiple languages. And as developers kinda sit down and understand temporal, the the two key elements are to think about how your application is structured vis a vis a concept called a workflow. And as part of that model, what you do is you put your error prone pieces of the application into something we call an activity. And so that starts sort of creating this separation of concern between the logic of a multistep process and encapsulating kind of a step that could be error prone into an activity. And from there, you can just set policies around how many times you wanna retry, exponential back off, you know, all of the sort of decorators and the power around what you want to do with respect to handling of that error. But you don't actually need to build that logic. So one thing we find is in any application code, you know, roughly 60% of the code is around the scaffolding around the the error handling pieces of it, and that just goes away with temporal.
And so you get much better readability. You get the ability to focus on just writing your business logic, and temporal will handle everything else for you.
[00:06:24] Tobias Macey:
And with that focus on reliability and high uptime and the separation of concerns from the reliability and state management versus the business logic, I'm curious how you're seeing that impact the way that data teams are thinking about the use of temporal within their work, whether that is building different ETL pipelines or managing state storage for some sort of data oriented application or various other use cases.
[00:06:56] Preeti Somal:
Absolutely. And so, really, when you talk about, sort of the the data aspect of this, you know, a pipeline and managing state, The what we are seeing is that all of these sort of tasks are multistep processes, and there is a lot of coordination and state management that goes into it. As well as, you know, if an error happens, do you restart from the from the start? And that could be really expensive, especially if you're doing some training models and task management on really expensive GPUs, etcetera. The the overall sort of job of the engineer becomes much simpler because they're able to sort of logically think about that pipeline in terms of the steps that that pipeline has and then build those steps out using temporal without needing to worry about state management or queues or checkpointing or, you know, any of the sort of complexities that aren't related to the task at hand just sort of goes away.
And so what we are seeing is, especially given all of the the sort of data hunger for AI applications, kinda the the number of customers running their data pipelines on temporal has actually kind of really exponentially grown. And then the other piece of it is they're able to go faster. So one of the interesting sort of dimensions here to talk a bit about is often in software engineering, developer productivity and reliability is considered to be at odds with each other. And we really think that's a false dichotomy. We with temporal, we believe, and we have customers sort of attesting to this, that you can increase the developer productivity while bringing that reliability and scale as well.
[00:08:59] Tobias Macey:
And then digging a bit more into some of those foundational primitives of temporal and the overall space of durable execution, you mentioned checkpointing, which is something that has very specific meanings in different contexts where if you're doing machine learning, you'll typically checkpoint at a particular completion of an epoch of a training round. In systems such as Flink, there are checkpoints as far as once you're done processing a particular window. When you're dealing with something such as Spark Streaming, it has the idea of microbatches where maybe you're going to checkpoint at each completion of a batch. And I'm wondering, as you work with data teams who are thinking about where and how to incorporate something like temporal, how does that change the ways that they think about the foundational primitives in the tools that they're relying on and maybe some of the ways that they can reduce some of that reliance and, in some cases, maybe even move to a simpler tool because Temporal handles some of that heavy lifting?
[00:10:01] Preeti Somal:
Yeah. And that's a great question because I think what this question teases out is a lot of the tools you mentioned were built specifically for the data space. And as you know, temporal is a general purpose durable execution platform. And while we are getting a ton of usage in data, we are not out of the box. The abstractions are not extremely opinionated about how the data engineers should be structuring their workflows. Right? So a checkpoint is something that actually isn't even a term that shows up in our abstractions. But depending on how the team at hand is thinking about their needs, they can implement that using the signals and the activities and the and the tasks and kind of the their set of abstractions that the temporal programming model delivers.
[00:11:01] Tobias Macey:
And one of the higher level primitives in temporal is the concept of a workflow, which is a sequence of tasks to complete. I know it also has a concept of activities, particularly when you're dealing with the data ecosystem. Workflows, again, have a very specific meaning, which is generally incorporated into the idea of a data orchestration engine that handles the sequence of steps in a particular directed acyclic graph or DAG. And I'm wondering as people are starting to adopt temporal maybe for application use cases, how maybe that shifts the ways that data teams are thinking about their usage of orchestration engines, the role of the orchestration engine in the management of workflows and state related to those workflows. And, also, as application teams and data teams maybe are building on that same substrate, brings those use cases closer together.
[00:11:58] Preeti Somal:
Yeah. And, you know, really great question. Again, I think that that really highlights the power of the platform. So, clearly, temporal is code first, and a lot of the tooling that exists in the data specific space is oriented around the DAGs. Right? And so we believe and what we're seeing is that the code first approach lets you reason with the logic in a much more compelling way and provides the flexibility and scale that you need as your pipelines get more and more complicated. We are seeing in fact, we had a talk at our conference, where, you know, this particular customer actually as they moved from in their case, they were using Airflow. As they moved from Airflow to temporal, the first phase of that was they actually just sort of built a DAG to temporal workflow mapping to get their teams sort of up and running with temporal and get familiar with temporal. And then as they learned more, they started taking out sort of the DAG pieces and and being called first. So I think that is a super compelling sort of approach here. And then I think the second piece, and this is where we see a lot of the use cases that illustrate both the AI application and data sort of domains coming together as well as the sort of more real time components as well is being able to send, signals, for instance, from your data pipeline processing to your application.
And I don't know if you've ever looked at Nexus, but one of the main sort of advances here we are seeing with durable execution is as teams are building their workflows, the data team might wanna have a way to invoke something in the application tier or the other way around. And Nexus is a, sort of, an extension of temporal that allows you to make these calls across boundaries in a secure way. So we really believe that and we're seeing patterns where building both the data pipelines and the application code on the same platform allows a lot more richness, in the kinda end application that the consumer is seeing.
[00:14:33] Tobias Macey:
That's also interesting as we start to dive into some of the areas of AI engineering and AI systems design because I think that's another factor that is pushing these teams closer together where, for a long time, application teams would focus on their user facing applications. Data teams would handle their exhaust and try to turn it into insights to the organization that would maybe then get piped back into the application. And machine learning and AI teams would focus on using those data assets to turn them into machine learning models either for business optimization or for user facing features.
And with the introduction of generative AI systems, it forces all of those teams to work much closer together because the cycle time is much faster and the degree of experimentation can happen much quicker. And I'm just wondering how maybe that substrate of temporal being able to work across those boundaries also factors into the necessities of the organizational realities as we're bringing AI more into the inner loop of the business?
[00:15:43] Preeti Somal:
Yeah. Absolutely. I think, you know, first and foremost, I think it is allowing kind of that common durable execution platform to be used across multiple domains, which, as you were pointing out, were historically pretty segregated. And, you know, just even kind of the feedback loops there were were weeks and months as opposed to the need now around being as close to real time as possible. And especially with the AI use cases, you know, what what's definitely happening is the pattern around what are some of the common sort of data prep elements that exist that are needed, and those coming in place, and then the app teams sort of iterating much more quickly and driving feedback into kind of that data prep layer. And being able to do that in a way where you can actually break down the silos and have sort of, you know, an RPC level for the lack of a better word, like, you know, reach across what were historically hugely firewalled and ring fenced domains.
That we're seeing is really just expediting the time to deliver these capabilities.
[00:16:58] Tobias Macey:
And when data teams are faced with the technical decisions of how to implement a particular use case, particularly if they're dealing with a step based workflow, what are some of the heuristics that you're seeing them use when they are trying to decide, okay. Well, I have my orchestration engine, whether that's Airflow, Daxter, Prefect, what have you. And I have temporal because the application team is using it, or maybe they're starting to use it for some specific data use cases. I'm just wondering how they figure out that decision point of what are the features that they need and what are the tools that they're going to use for them, and then in particular, what are some of the hybrid opportunities for being able to integrate temporal into that orchestration engine?
[00:17:46] Preeti Somal:
Yeah. So what we are seeing is a couple of patterns. One pattern is, you know, as temporal is getting more and more adopted, the amount of community and developer love that we are seeing for temporal, the we have kind of these champions for temporal within organizations. So so the first pattern we're seeing is just a a sheer grassroots adoption pattern where you've got an engineer that had a great experience using temporal and is completely a fan of durable execution, and they're going and and helping other teams understand how to think about temporal. Right? The second pattern we're seeing is within some organizations where there are sort of the the senior kinda architects, principal level engineers that are thinking ahead in terms of the dependencies and the kind of the hybrid nature of these applications.
They are the ones who are bringing temporal and nexus into the organization. And then I would say the final pattern that we are seeing quite a bit of is data teams that are seeing you know, started with Airflow, for instance, and it's just not scaling for them. You know, the schedules aren't running as expected. They're missing the reliability and scale, and they're looking for a solution that is, like, a proven solution at scale. So a lot of our sort of customers that are in the in the camp of migrating from one of the existing sort of products to temporal, You know, a lot of them, the common theme there is the scale and reliability that they need.
[00:19:33] Tobias Macey:
And moving into the AI use cases, as data teams are starting to come to grips with what are the actual requirements for the purpose of the AI application, or they're trying to feed the appropriate data to the teams who are building maybe an agentic use case. What are some of the ways that temporal simplifies the workflow of doing that iteration or decomposing the state requirements for the AI application and, just some of that interface between the data preparation and data curation stage and the actual activation stage in the context of an AI system?
[00:20:17] Preeti Somal:
Yeah. So, one one pattern here that we're seeing is the the sort of incremental sort of data prep and signaling. We also have some use cases where the data prep needs sort of a human in the loop type sort of thing. We have a customer that we were talking with recently where they are actually wanting to have what they call accountability markers in the in kind of the final stage of the data prep before that gets surfaced to the application. And that marker could be, again, either a human or a, sort of a a validation system of some kind. So it's, you know, what we're seeing is that there's sort of a a multistage complex flow here, that brings in these requirements around the sort of accuracy and trust elements as well that are really easy to implement with temporal, again, because it is a general purpose sort of durable execution platform with a very powerful programming, sort of model around it. One other sort of use case we are also seeing we haven't talked about yet is we're starting to get used also in just the pure task management and scheduling of the sort of data prep side of the house as well, and signaling that to the application around a new batch of data that's just come in is another really interesting example. So one of the case studies we've published is with a customer that uses us for medical records transcription and kinda real time sort of visit summary preparation as well.
And you can imagine, you know, there's there's a number of pieces in there that also relate to compliance. And so I think the the core thing here that I I guess I'm trying to articulate is the complexity and the requirements as you look at how data feeds these applications is growing pretty large because the these domains are sort of coming closer together. And that's where needing a platform that can help you build that simply is really compelling.
[00:22:42] Tobias Macey:
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull to book a demo and see how they turn months long migration nightmares into week long success stories.
Another aspect of the system design and architecture that I'm curious about when you're building with temporal is we've been talking about temporal as a means of state management and durable execution, whereas data engineering as a discipline is entirely concerned with very stateful assets. And a lot of times, the scale of that state is the core of the problem where you need to deal with terabytes, exabytes, petabytes of data writ large and temporal being a primarily database backed system, I imagine, is more focused on state with a much smaller scale in the order of bytes, kilobytes, megabytes. And I'm curious how that factors into the ways that people are using temporal state management in juxtaposition with the much larger scale of state management that some of these broad data systems require to be able to operate on?
[00:24:32] Preeti Somal:
Yeah. That's a great question. The the main thing here to note is that the way the temporal model works is what the state that temporal is managing is the state of where your workflow and activities are. So the the beauty, the elegance of our model is that you as the as the engineer is are running kind of what we call workers. Your code runs in your environment, and you don't have to slap all the data over to us. You kind of the data resides, you know, where it needs to reside. Your workers, the code that gets built using the temporal SDK can run-in your environment. In fact, we want it to run-in your environment so that it can sit as close to the data as as you need it to. What we are managing is the orchestration of the tasks and the activities that the workers are running. And this is a really key point because of a number of reasons.
One, it really enables a super elegant security model where on the temporal side, we don't see your data. We are just seeing any sort of input, output parameters to your workflows or pointers to s three buckets or, you know, whatever you need in terms of the workflow execution context. And that too is encrypted by you, and you are the only one that has the keys to that. So what you're passing to us is very small and, as far as we're concerned, is garbage. And then the second main reason this is really critical is that this is exactly why temporal can scale and doesn't hit the limits that we see other systems hitting because what we manage is purely kind of the task and the workflow execution state, not sort of your your business or your data application, sort of specific things. Right?
[00:26:35] Tobias Macey:
And because of the fact that you are using that worker to manage execution, it allows you to use your other tools that are accustomed to doing that heavy lifting. And I'm wondering then what are some of the key pieces of state or information that is useful to maintain and temporal for being able to recover from failure or, just some of the ways that people are using that statefulness, maybe going back to our earlier conversation about things like checkpointing in Flink to be able to handle that without necessarily reaching for some of those heavier weight, more complicated engines because you're able to use temporal for those, you know, state maybe smallest state for those executions?
[00:27:23] Preeti Somal:
Yeah. Absolutely. So the core of how temporal works is the worker kinda runs in your infrastructure. And on the temporal server side, we have the notion of a task queue. And the worker is essentially long polling the temporal server around give me my next task, give me my next task. Right? And the the give me my next task takes with it, you know, whatever bare minimum sort of context parameters you need to run that. And so what temporal on the server side is doing is maintaining that state around where the worker is, what tasks has it has executed, what's the next one to be dispatched, and so on. And so if a crash happens, what we can do is we can pick up from exactly the last task that was dispatched, and we call this we call this capability replay. And so this is not it is literally not replaying the entire sequence of events. It is essentially just picking up from where it fell over and then running through the next set of things. And because it has enough of the context around what was executed and what's next and what the input and output were, we can just pick up and and sort of tell your worker, here's where you were, get going. I know it's a hugely simplified sort of description, but I'm hoping that that that helps with that question.
[00:28:50] Tobias Macey:
Yeah. That's definitely useful. And then moving more into that AI system design, digging a bit deeper into that where one of the newer requirements that a lot of data teams are maybe less familiar with is the introduction of things such as vector databases, the corpus management for knowledge bases that the LLMs rely on, the requirement to maintain freshness of that information to ensure that you're not feeding old data or bad information to the consumers of that AI system? And what are some of the ways that having that durable execution pattern simplifies or enables those teams to be able to build and experiment and maintain that corpus?
[00:29:36] Preeti Somal:
Yeah. I think, again, you know, at the end of the day, I think the main thing durable execution brings here is the rigor around thinking about the state and separating out the sort of the steps of the workflow and activities. Right? And so you would, you know, you'd still run your VectorDB in your own sort of infrastructure, your own accounts, and, you know, you you can have, like, an activity wrapper that can invoke that and and get the context from there and do the checks on freshness, etcetera. One thing we haven't done yet, and this is this is a definitely a topic of conversation internally is this question of, you know, temporal is a general purpose durable execution platform. Should we be thinking about building data specific abstractions that really take some of these patterns we're seeing and helps developers do that more easily?
And, honestly, you know, for us, this is a a big discussion because we have sort of stayed a little agnostic around being very opinionated about things, and we feel like that really helps with a lot of the developer sort of empowerment and creativity and and control over the way they wanna implement their use case. But, you know, there's definitely a question on the table here for us around these abstractions, and should we be thinking about, you know, building some abstractions that make that sort of data prep pattern a little bit more sort of in product out of the box versus maybe a best practice or a sample or a demo.
[00:31:20] Tobias Macey:
And when I was preparing for this conversation, I was reading through some of the blog posts on the Temporal blog about some of the ways that using Temporal as some of the state management for agentic applications reduces some of the complexity of actually building those systems because it can be the system of record for the various conversational flows for the executions and tool uses of the models and then being you a failure of an agent to be able to successfully execute a tool call, you have that complete state, whereas a number of the agentic frameworks want to be the owner of that information, or maybe it is by default, all going to be in process or in memory. And I'm curious how, with the introduction of durable execution, those frameworks can be used more effectively or just maybe some of the ways that the introduction of this pattern is help shifting the ways that people are thinking about the design of those types of systems?
[00:32:40] Preeti Somal:
Yeah. Absolutely. You know, to to someone who is spending their, day and night thinking about durable execution, the the fact that it's all in memory and a crash might mean that the the user has to start all over again. Like, that is, sends shivers down my spine for sure. So we you know, what we're seeing is that the frameworks don't have durability in place, and and this is where we're doing integration. So we have a first party integration with the OpenAI agent SDK, for instance, that brings durability into the picture. We are also honestly seeing, you know, customers building agents without needing to use frameworks.
Right? And so, you know, customers are building these agents just on top of the durable execution, abstractions, and foundation. In particular, you know, the the the interesting thing is we are still early in the sense of we believe that these agents are going to need to be even more longer lived. You you know, I I like, for instance, there's no reason why the an agent I use is an interactive agent. You know? What I wanna be able to do is give the agent some work, go away, and come back after whatever amount of time and and get my results. And so the the agent patterns are very, very much kind of the asynchronous, long running, durable execution patterns that temporal has been solving for a very long time. And, we're starting to see that value creation now kind of coming through. So, you know, for the set of developers that do wanna use frameworks, we are integrating with frameworks and and, you know, we can kind of bring durable execution to them. But we're also seeing the usage of temporal in just building agents.
And the piece that you were referring to, we we I haven't talked much about, but a really compelling part of temporal is you can go into temporal and you can look at the execution of your workflow, and you can see exactly, you know, what services were called, what LLM calls were made. And, you know, that visual sort of observability piece is a big part of the value. And you can also export this history, and we have customers that are using that for their audit and compliance needs as well.
[00:35:04] Tobias Macey:
One of the other patterns that I've seen for a similar use case is to proxy through an LLM gateway to be the system of record for all of the interactions between your application and the LLM API. Tensor zero is one in particular that I'm familiar with that actually uses a ClickHouse database as that state store and will then use that information as a means to execute reinforcement learning to do fine tuning of the model that you're using and improve efficiency and effectiveness. And I'm curious how temporal or teams that are using temporal are maybe doing similar use cases or what you see as some of the trade offs between those two approaches of either using temporal as that state store versus proxying through the LLM gateway and using something like a dedicated database as that state store? Yeah.
[00:35:59] Preeti Somal:
I think it depends on what you're trying to achieve here. I think the that use case around sort of the the tensor zero piece, you know, you could build that pretty easily on temporal. And the value of building that on temporal would mainly be kinda getting the resiliency and the error handling, you know, all of these pieces that durable execution brings forward. How you make the decision on whether you use what what we would sort of call a a sort of a more opinionated sort of offering versus a general purpose platform, I think, is really dependent on what is the end goal that you're trying to achieve.
But we do we do see this pattern where sort of customers are using temporal to call the LLM. And, again, the benefit really is that the actual call gets done from your worker that runs in your environment. And so if if you've got sort of the complexity of use cases around private hosted models or sort of data privacy pieces, etcetera. You know, you kind of the temporal model lets you have full control on the output of what the LLM is bringing, where you store it, how do you compare it and check it and validate it and so on.
[00:37:18] Tobias Macey:
I think one of the key aspects of temporal as the state store is that it is also a programmatic substrate versus something that is maybe more opinionated or constrained in terms of what it is expecting to do. And so it gives you broader flexibility in terms of how to take advantage of that state without having to necessarily do additional integration work to reuse that so you can actually use the data in situ rather than having to do an extraction, transformation, either reimport or do a more roundabout means of using that. And I'm curious how that changes the ways that teams are selecting the supplemental tools that they're relying on once they do start using temporal.
[00:38:06] Preeti Somal:
Yeah. It's a great point because I think the core message here is temporal is a code first, you know, developer tool. Right? And so I was talking about the programming languages. Our goal is to meet the developer where they are in the language of their choice. And so the the value here would be that you would be able to fit your temporal usage in your existing engineering practices, your CICD pipelines, how you do testing. You know, it we wanna be able to fit in your processes without inducing more overhead for you. And I think that is a big part of the decision making, criteria that goes in here because we we aren't going to come in and mandate the use of a specific programming language, and we're gonna give you the power of the SDK.
Now the flip of that conversation always is, well, is an opinionated system sort of faster to use, or or is that better for me? And then in a lot of cases, it might be, and that's totally fine. You know, for us, again, what we are building is a platform where we believe that the developers can use the power of the platform as they see appropriate and, you know, the the fit in with how they are building their software today.
[00:39:29] Tobias Macey:
And for teams who are starting to adopt temporal or they are figuring out how best to design around it, what are some of the key primitives or key design patterns that you see teams maybe either struggling to understand or maybe they are not using temporal in the most idiomatic manner and just some of the useful either references or pieces of advice that you have for teams who are starting to tackle that design phase of, okay, temporal seems great, durable execution seems like it would be really useful, but what do I actually do to get started?
[00:40:08] Preeti Somal:
Yeah. So this is a really interesting topic because, it is almost at you know, the the the first reaction that people have is is, like, it's almost magic, that kind of a reaction. Right? And and what we what we see is that once someone understands temporal, they cannot look back. They are essentially fully onboard with the programming model. But it it requires them to have an open mind because as an engineer, you've been trained for years to be thinking about all of the error scenarios, and you've got all of this code that is gonna deal with all the reliability pieces. And then someone comes and tells you, oh, you don't need any of that anymore. You know? Of course, there is some element of, you know, skepticism and sort of unlearning that needs to happen here. Our recommendation always is to just, you know, get your hands dirty, try it, run some samples, you know, read the blogs. But, inherently, it's, you know, you've gotta try it. And once you understand the power of the platform, then your design decisions become hugely simplified.
One other thing that we hear a bunch about, especially as we're talking to more of, like, the VP engineering, the CIO type audience. You know, one of their first questions is, okay. What do you replace? And what they're trying to do is pattern match. Does this mean I can get rid of my queue here or I can do this or that? And, you know, it's really it's really temporal is yes and and more is kind of the answer. Right? You have to really start thinking about the initial set of use cases and learn. And then from there, you'll what we find is then, you know, engineers are applying temporal across all these domains that we didn't quite imagine that they would, sort of think about use cases for.
[00:42:00] Tobias Macey:
And as teams are coming up to speed with temporal or maybe they're evolving in terms of their sophistication of its use and the level of integration into their systems, what are some of the most interesting or innovative or unexpected ways that you've seen temporal and the durable execution pattern used in these data and AI workloads?
[00:42:20] Preeti Somal:
So one of the ones that, I find fascinating and interesting is we had a use case come up where, someone was intentionally killing their workers because they wanted to optimize the usage and the cost of the workers. So they would just go in and kill the worker knowing that temporal could pick up wherever that worker sort of left off. So I I I thought that was fascinating. Other interesting things that I'm seeing and and, you know, I I have a tremendous amount of respect for folks in the data field is just the sheer volume of the data processing that's happening and how having a production ready scale orchestration durable execution platform is condensing the time it takes. So we had a customer tell us that the pipelines that were running took, like, eighteen hours to run. They've been able to condense that down to five minutes because they have temporal has forced them to think about sort of what are all the various steps in this process and how can they sequence these steps so that they can actually go faster.
And if if something fails, they they don't need to start all over again.
[00:43:37] Tobias Macey:
And as you have been working in this space and working with the company and the community around Temporal and understanding more deeply the capabilities that it provides and the use cases that it can be applied to, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:43:59] Preeti Somal:
Wow. So I think I think the main lesson for me is if you really think about it, what durable execution and temporal does is it takes the onus on reliability to be in the temporal server. And, of course, our business model is one where we run temporal cloud, and our responsibility there is really high because, you know, the the core of the promise we're making is we will handle reliability for you. And the way we do that is by putting that problem in the temporal server. And so, of course, the temporal server has to be incredibly reliable.
And I think that is really the the main it's it's not surprising when you think about it, but living that with a cloud delivery model, you know, especially with some of the outages that we've been seeing as well and making sure that we live up to our promise there. You know, that's something that we think about constantly.
[00:45:04] Tobias Macey:
And for people who are designing systems, thinking about how best to manage the reliability of their data workflows or their AI applications, what are the situations where you would advise against the use of temporal or durable execution as a pattern?
[00:45:22] Preeti Somal:
I think situations where you you know, I know you used the word designing them for reliability, so it makes it harder for me to answer that. I would say situations where, you know, a crash happens and and you really don't care about it, you know, you may not need temporal at that point. But I think anytime where you need to scale and be reliable, we do believe that, you know, you should be looking at temporal and the value it brings there.
[00:45:54] Tobias Macey:
And I guess put another way, what are the situations where the incorporation of temporal adds excessive complexity or unnecessary coordination to an application design?
[00:46:08] Preeti Somal:
Again, I think if you've got you know, if you're doing early prototyping and you have not established your business value yet or, you know, you're you're, like, just you've got some toy agents that you're trying to figure out what's really gonna be the core IP. You know? Maybe maybe you don't need temporal there. I think, you know, I think we are kind of this this notion of the complexity piece is something that, you know, we are working towards doing some more sort of DevRel education on because we really do believe that this the complexity thing is like a false argument here. Because the whole premise around temporal is to make your life easier.
And so, you know, the question really becomes, you know, where is that complexity coming from? Is it learning temporal? Is it running temporal? And we really believe that with all the progress we've made, it should be a nonargument. But, of course, I work here, and, you know, I'm willing to be convinced otherwise and, continue to work towards improving.
[00:47:19] Tobias Macey:
And you mentioned already that you are trying to combat the tendency to form strong opinions or build excessive integrations into temporal for specific use cases. But I'm wondering what are some of the things you have planned for the future of temporal and the durable execution pattern, in particular, with an eye towards how it can be used in these data and AI systems?
[00:47:43] Preeti Somal:
Yeah. So I think the main focus for us is to continue to improve sort of the onboarding experience onto temporal, and, really, the durable execution sort of execution sort of constructs that have any rough edges. How do we make sure that we are continuing to sand them down? And all of this is really tying to our core focus on the developer. And so what you will see from us is continuing engagement on what are the developer pinpoints and scenarios. And I'll give you a concrete example. For instance, versioning of workflows and how do you deploy new versions, how do you sort of incrementally roll out traffic across these versions. Maybe that there was a long running workflow on a previous version and you wanna make sure that you can execute multiple versions and and wait till that workflow is is completed. We're doing a lot of product work around simplifying that whole space for the developer community.
So that's that's the kind of thing that you'll see from us is around just listening to our community and the developers and just sort of working hard to improve the overall experience.
[00:49:01] Tobias Macey:
Are there any other aspects of durable execution as a pattern, temporal as an implementation of that, and the application of those capabilities to data and AI systems that we didn't discuss yet that you would like to cover before we close out the show?
[00:49:17] Preeti Somal:
I think really quickly, the we did touch on Nexus a little bit during the conversation, but, you know, if if there are folks listening to this that haven't checked out Nexus, definitely recommend taking a look there. Nexus is also in open source, and it this is how we believe sort of organizations that have teams working on different parts of the stack can actually build applications that can call across boundaries. So that that would be my only sort of call out here.
[00:49:51] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the temporal team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:09] Preeti Somal:
Gosh. Not not living in the space. I'm not sure I have a great answer for you. You know, I think one of the things that does come to my mind is just around effective use of the resources, you know, whether those are costly GPUs, etcetera. Just how can, you know, sort of more tooling around helping manage those seems like a pattern. And, we're seeing people use temporal to solve that, and and maybe someday, there will be sort of more opinionated tooling around that.
[00:50:43] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences on the ways that temporal and durable execution can be used to simplify the design and implementation of these data intensive systems and ways to reduce the complexity of managing both the business logic and the resilience and reliability. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.
[00:51:12] Preeti Somal:
Thank you so much. It was a pleasure to chat with you today.
[00:51:16] Tobias Macey:
Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast.init covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at AI engineering podcast dot com with your story.
Guest intro: Preeti Somal and Temporal Technologies
What is durable execution and why it matters
Temporals programming model: workflows vs. activities
Impact on data teams: pipelines, state, and retries
From data checkpoints to Temporal primitives
Workflows vs. DAGs and migrating from Airflow
Bridging app and data teams with Nexus and signals
AI engineering brings faster cycles and tighter loops
Choosing tools: orchestration engines and hybrid patterns
AI data prep: human-in-the-loop and accountability markers
Temporal state vs. big data state: scaling and security model
Replay and recovery: how Temporal resumes work
Vector DBs, freshness, and possible data abstractions
Durable agentic systems and framework integrations
LLM gateways vs. Temporal as programmatic state
Developer-first: languages, CI/CD, and flexibility
Getting started and thinking differently about reliability
Innovative uses: killing workers and 18 hours to 5 minutes
Operating Temporal Cloud: reliability lessons
When not to use Temporal and complexity concerns
Roadmap: smoothing edges, versioning, and rollouts
Nexus recap and cross-boundary calls
Biggest gaps in data tooling and resource efficiency
Closing thoughts and appreciations