State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.

The result, inflexible infrastructure that can't adapt to different workloads.

That's why Cash App and Cisco rely on Prefect.

Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.

Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.

Orchestration is the foundation that determines whether your data team ships or struggles.

ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform.

WHOOP and 1Password also trust Prefect for their data operations.

If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.

Composable data infrastructure is great until you spend all of your time gluing it back together.

Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy

lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.

Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.

Go to dataengineeringpodcast.com/bruin

today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

Your host is Tobias Macy, and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures. So, Preeti, can you start by introducing yourself?

Hi, everyone.

Glad to be here chatting with you today. My name is, Preeti Somal, and,

I run engineering

at Temporal

Technologies.

We are the pioneer behind durable execution.

Prior to this, my background has been a lot of enterprise software, most recently at HashiCorp.

And do you remember how you first got started working in the area of data?

Yeah. I think my first,

sort of exposure to data was actually at Yahoo. I was at Yahoo for four years, and that's when I learned all about the wonderful world of Hadoop

and data.

I was more on the systems management side, and

one of our sort of dreams was to get all of the monitoring data into the Hadoop clusters

and kinda see what magic comes out of that.

And now you're working at temporal, which is definitely in the forefront of this space of durable execution

and

fail proof application design.

And I'm wondering if you can just give a bit of an overview about what that term means when somebody says durable execution and some of the ways that it changes the way that people think about their overall system architecture.

Absolutely. So the way we like to explain this is what if

your application

crashes

and the crash is inconsequential?

And that's exactly what durable execution gives you. The goal here is to offload

the developer,

the engineer from all of the heavy lifting that goes into building

reliable, resilient applications

and take all of that work and deliver it through a platform.

And that platform is temporal, and that's the core of the durable execution

value that we are delivering.

And so in terms of that resiliency

to errors and failure,

obviously, not every error is one that you can automatically

recover from, but many of them are just due to transient issues, whether those are brief network outages or an application node going down to get restarted because it's doing a rollout of a new version,

or maybe there is just a configuration error that needs to be addressed, and then you can retry it. And I'm just wondering if you can can talk to some of the ways that

as you are building on top of something like temporal, it forces you to think about what are those different failure modes and what are the appropriate behaviors.

Yes. Absolutely. So I think the the main thing to talk through here is temporal

is a programming model, and our approach has been to really build

idiomatic

SDK support. So we have

support for temporal in multiple languages.

And as developers kinda sit down and understand temporal, the the two key elements are

to think about how your application

is structured

vis a vis a concept called a workflow.

And as part of that model, what you do is you put your error prone

pieces of the application into something we call an activity.

And so that starts sort of creating this separation of concern between

the logic of a multistep process and

encapsulating

kind of a step that could be error prone into an activity. And from there, you can just set policies around how many times you wanna retry,

exponential

back off, you know, all of the sort of decorators and the power around

what you want to do with respect to handling of that error. But you don't actually need to build that logic. So one thing we find is in any application code, you know, roughly 60%

of the code is around

the

scaffolding

around the the error handling pieces of it, and that just goes away with temporal.

And so you get much better readability.

You get the ability to focus on just writing your business logic,

and temporal will handle everything else for you.

And with that focus on reliability

and high uptime

and the separation of concerns from the

reliability and state management versus the business logic, I'm curious how you're seeing that impact the way that data teams are thinking about the use of temporal within their work, whether that is building different ETL pipelines

or

managing state storage for some sort of data oriented application

or various other use cases.

Absolutely. And so,

really, when you talk about,

sort of the the data aspect of this, you know, a pipeline

and managing state,

The what we are seeing is that all of these sort of tasks are

multistep

processes,

and there is a lot of coordination

and state management that goes into it. As well as, you know, if an error happens, do you restart from the from the start? And that could be really expensive, especially if you're doing some training models

and task management

on really expensive GPUs, etcetera.

The the overall sort of job of the engineer becomes much simpler because

they're able to sort of logically

think about that pipeline

in terms of the steps that that pipeline has

and then build those steps out using temporal

without needing to worry about state management

or queues or checkpointing

or, you know, any of the sort of complexities

that aren't related to the task at hand

just sort of goes away.

And so what we are seeing is, especially

given

all of the the sort of data hunger for AI applications,

kinda the the number of customers

running their data pipelines on temporal

has actually kind of really exponentially

grown.

And then the other piece of it is they're able to go faster. So one of the interesting sort of dimensions here to talk a bit about is often

in software engineering,

developer productivity and reliability

is considered to be at odds with each other.

And we really think that's a false dichotomy.

We with temporal, we believe,

and we have customers

sort of attesting to this, that

you can increase the developer productivity

while

bringing that reliability

and scale as well.

And then digging a bit more into some of those foundational

primitives

of temporal and the overall space of durable execution,

you mentioned checkpointing,

which is something that has very specific meanings in different contexts where if you're doing machine learning, you'll typically checkpoint at a particular completion of an epoch of a training round. In systems such as Flink, there are checkpoints as far as once you're done processing a particular window. When you're dealing with something such as Spark Streaming, it has the idea of microbatches where maybe you're going to checkpoint at each completion of a batch. And I'm wondering, as you

work with data teams who are thinking about where and how to incorporate

something like temporal,

how does that change the ways that they think about the foundational primitives in the tools that they're relying on and maybe some of the ways that they can reduce some of that reliance and,

in some cases, maybe even move to a simpler tool because Temporal handles some of that heavy lifting?

Yeah. And that's a great question because I think what this question teases out is a lot of the tools you mentioned were built

specifically

for the data

space. And as you know, temporal

is a general purpose durable execution platform.

And while we are getting a ton of usage in data, we are not

out of the box. The abstractions

are not extremely opinionated

about how the data engineers should be structuring

their workflows. Right? So a checkpoint

is something that actually isn't even a term that shows up in our abstractions.

But depending on how

the team at hand is thinking about

their needs, they can

implement that using the signals and the activities and the and the tasks and kind of the their set of abstractions that the temporal programming model

delivers.

And one of the higher level primitives in temporal is the concept of a workflow,

which is a sequence of tasks to complete. I know it also has a concept of activities,

particularly when you're dealing with the data ecosystem. Workflows, again, have a very specific meaning, which is generally incorporated into the idea of

a data orchestration engine that handles the sequence of steps in a particular directed acyclic graph or DAG. And I'm wondering as people are starting to adopt temporal

maybe for application use cases, how maybe that

shifts

the ways that data teams are thinking about their usage of orchestration engines, the role of the orchestration engine in the management of workflows and state related to those workflows. And, also, as application teams and data teams maybe are building on that same substrate,

brings

those use cases closer together.

Yeah. And, you know, really great question. Again, I think that that really highlights the power of the platform. So,

clearly,

temporal is code first,

and a lot of the tooling that exists in the data specific space is oriented around the DAGs. Right? And so we believe and what we're seeing is that the code first approach

lets you reason

with the logic

in a much more compelling way and provides the flexibility

and scale

that you need as your pipelines get more and more complicated.

We are seeing in fact, we had a talk at our conference,

where, you know, this particular customer actually

as they moved from in their case, they were using Airflow. As they moved from Airflow to temporal,

the first phase of that was they actually just sort of built

a DAG

to temporal workflow

mapping to get their teams sort of up and running with temporal

and get familiar with temporal. And then as they learned more,

they started taking out sort of the DAG pieces and and being called first. So I think that is a super compelling

sort of approach here. And then I think the second piece, and this is where we see a lot of

the use cases that illustrate

both the AI application

and data sort of domains coming together as well as the sort of more real time components as well is being able to send,

signals, for instance, from your data pipeline processing to your application.

And I don't know if you've ever looked at Nexus, but one of the main sort of advances here we are seeing with durable execution is as

teams are building

their workflows,

the data team

might wanna have a way to

invoke something in the application tier or the other way around.

And Nexus is a,

sort of, an extension of temporal

that allows you to make these calls across boundaries

in a secure way. So

we really believe that and we're seeing patterns where

building both the data pipelines and the application code on the same platform

allows a lot more richness,

in the kinda end application that the consumer is seeing.

That's also interesting as we start to dive into some of the areas of

AI engineering and AI systems design because I think that's another factor that is pushing these teams closer together where, for a long time, application teams would focus on their user facing applications.

Data teams would handle their exhaust and try to turn it into insights to the organization that would maybe then get piped back into the application.

And machine learning and AI teams would focus on using those data assets to turn them into machine learning models either for business optimization

or for user facing features.

And with the introduction

of generative AI systems,

it forces all of those teams to work much closer together because the cycle time is much faster and the

degree of experimentation

can happen much quicker. And I'm just wondering how maybe that substrate of temporal being able to work across those boundaries

also factors into the necessities

of the organizational realities as we're bringing AI more into the inner loop of the business?

Yeah. Absolutely. I think, you know, first and foremost, I think it is allowing

kind of that common durable execution platform to be used across multiple

domains, which, as you were pointing out, were historically pretty segregated.

And,

you know, just even kind of the feedback loops there were were weeks and months as opposed to the need now around being as close to real time as possible. And especially with the AI use cases, you know, what what's definitely happening is the pattern around what are some of the common

sort of data prep elements that exist

that are needed, and those coming in place, and then

the app teams

sort of iterating much more quickly

and driving feedback into kind of that data prep layer. And being able to do that in a way where you can

actually break down the silos and have sort of, you know, an RPC level for the lack of a better word, like, you know, reach across what were historically hugely firewalled

and ring fenced domains.

That we're seeing is really just expediting

the time to deliver these capabilities.

And

when data teams are faced with the technical decisions of how to implement a particular

use case, particularly if they're dealing with a step based workflow,

what are some of the heuristics that you're seeing them use

when they are trying to decide, okay. Well, I have my orchestration engine, whether that's Airflow, Daxter, Prefect, what have you. And I have

temporal because the application team is using it, or maybe they're starting to use it for some specific data use cases. I'm just wondering how they figure out that decision point of what are the features that they need and what are the tools that they're going to use for them, and then in particular, what are some of the hybrid opportunities for being able to integrate temporal into that orchestration engine?

Yeah. So what we are seeing is a couple of patterns. One pattern is, you know, as temporal is getting more and more adopted,

the amount

of community and developer love that we are seeing for temporal,

the we have kind of these champions

for temporal within organizations.

So so the first pattern we're seeing is just a a sheer grassroots

adoption pattern where you've got an engineer that had a great experience using temporal

and is completely

a fan of durable execution,

and they're going and and helping other teams understand

how to think about temporal. Right? The second pattern we're seeing is

within

some organizations

where there are sort of the the senior kinda architects, principal level engineers that are thinking ahead

in terms of the dependencies

and the kind of the hybrid nature of these applications.

They are the ones who are bringing temporal and nexus

into the organization.

And then I would say the final pattern that we are seeing quite a bit of is

data teams that

are

seeing you know, started with Airflow, for instance, and it's just not scaling for them. You know, the schedules aren't running

as expected.

They're missing the reliability

and scale,

and they're looking for a solution that is, like, a proven solution at scale.

So a lot of our sort of customers that are in the in the camp of migrating from

one of the existing

sort of products to temporal,

You know, a lot of them, the common theme there is

the scale and reliability

that they need.

And moving into the AI use cases, as data teams are starting to

come to grips with what are the actual requirements

for the purpose of the AI application, or they're trying to feed the appropriate data to the teams who are building maybe an agentic use case.

What are some of the ways that temporal

simplifies

the workflow

of doing that iteration or decomposing

the state requirements

for the AI application

and, just some of that

interface between the data preparation and data curation stage and the actual activation stage in the context of an AI system?

Yeah. So,

one one pattern here that we're seeing is the the sort of incremental

sort of data prep and signaling.

We also have some use cases where the data prep needs sort of a human in the loop type sort of thing.

We have a customer that we were talking with recently where they are actually

wanting to have what they call

accountability

markers

in the

in kind of the final stage of the data prep before

that gets surfaced to the application.

And that marker could be, again, either a human or a,

sort of a a validation

system of some kind. So it's, you know, what we're seeing is that there's sort of a a multistage

complex

flow here,

that brings in these requirements

around the sort of accuracy and trust

elements as well that are really easy to implement

with temporal,

again, because it is a general purpose

sort of durable execution platform with a very powerful programming,

sort of model around it. One other sort of use case we are also seeing we haven't talked about yet is we're starting to get used also in just the pure

task management

and scheduling

of the sort of data prep side of the house as well,

and signaling that to the

application

around

a new batch of data that's just come in is another really interesting example. So one of the case studies we've published is

with a customer

that uses us for medical records transcription

and kinda real time sort of visit summary preparation

as well.

And you can imagine, you know, there's there's a number of pieces in there that also relate to compliance.

And so I think the the core thing here that I I guess I'm trying to articulate is the complexity

and the requirements

as you look at how data feeds these applications

is growing

pretty large because

the these domains

are sort of coming closer together.

And that's where needing a platform that can help you build that simply

is really compelling.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake,

migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price.

Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull

to book a demo and see how they turn months long migration nightmares into week long success stories.

Another

aspect

of the system design and architecture that I'm curious about when you're building with temporal

is we've been talking about temporal as a means of state management

and durable execution,

whereas data engineering as a discipline

is entirely concerned with very stateful

assets.

And a lot of times, the

scale of that state is the core of the problem where you need to deal with terabytes, exabytes, petabytes of data

writ large

and

temporal

being a primarily database

backed system, I imagine, is more focused on state with a much smaller scale in the order of bytes, kilobytes, megabytes.

And I'm curious

how

that

factors into the ways that people are using temporal state management

in juxtaposition

with the much larger scale of state management that some of these

broad data systems require to be able to operate on?

Yeah. That's a great question. The

the main thing here to note is that the way the temporal model works is what the state that temporal is managing

is the state of

where your workflow

and activities

are. So

the the beauty, the elegance of our model is that you as the as the engineer

is are running kind of what we call workers.

Your code runs in your environment,

and you don't have to slap all the data over to us. You kind of the data resides,

you know, where it needs to reside.

Your workers, the code that gets built using the temporal SDK

can run-in your environment. In fact, we want it to run-in your environment so that it can sit as close to the data as as you need it to. What we are managing is the orchestration

of the tasks and the activities that the workers are running. And this is a really key point because of a number of reasons.

One, it really enables a super elegant security model

where

on the temporal side,

we don't see your data.

We are just seeing any sort of input, output parameters to your workflows

or pointers to s three buckets

or, you know, whatever

you need in terms of the workflow execution

context. And that too is encrypted by you, and you are the only one that has the keys to that. So what you're passing to us is very small and, as far as we're concerned, is garbage.

And then the second main reason this is really critical

is that

this is exactly

why temporal

can scale

and doesn't hit the limits that we see other systems hitting because

what we manage is purely kind of the task and the workflow execution state, not sort of your your business or your data application,

sort of specific things. Right?

And because of the fact that you are using that worker to manage execution, it allows you to use

your other tools that are accustomed to doing that heavy lifting.

And I'm wondering then what are some of the key pieces of

state or information

that is useful to

maintain and temporal

for being able to recover from failure or, just some of the ways that people are

using that statefulness,

maybe going back to our earlier conversation about things like checkpointing in Flink to be able to handle that without necessarily

reaching for some of those heavier weight, more complicated engines because you're able to use temporal for those, you know, state maybe smallest state

for those executions?

Yeah. Absolutely. So the core of how temporal works is the worker kinda runs in your infrastructure.

And on the temporal server side,

we have the notion of a task queue. And the worker is essentially long polling the temporal server around

give me my next task, give me my next task. Right? And the the give me my next task takes with it, you know, whatever bare minimum sort of context parameters you need to run that. And so what temporal on the server side is doing is maintaining that state around

where the worker is, what tasks has it has executed,

what's the next one to be dispatched, and so on. And so if a crash happens,

what we can do is we can pick up from exactly the last task that was dispatched,

and we call this we call this capability

replay. And so this is not it is literally not replaying the entire sequence of events. It is essentially just picking up from where it fell over and then running

through the next set of things. And because it has enough of the context around what was executed

and what's next

and what the input and output were,

we can just pick up and and sort of tell your worker, here's where you were, get going. I know it's a hugely simplified

sort of description, but I'm hoping that that that helps with that question.

Yeah. That's definitely useful. And then moving more into that AI system design,

digging a bit deeper into that

where one of the newer requirements

that a lot of data teams are maybe less familiar with is the introduction of things such as vector databases,

the corpus management

for knowledge bases that the LLMs rely on,

the requirement

to maintain

freshness of that information to ensure that you're not feeding old data or bad information

to the consumers of that AI system?

And what are some of the ways that having that durable execution

pattern

simplifies or enables those teams to be able to

build and experiment and maintain that corpus?

Yeah. I think, again, you know, at the end of the day, I think the main thing durable execution brings

here is the rigor

around thinking about

the state

and separating

out the sort of the steps of the workflow

and activities. Right? And so you would, you know, you'd still run your VectorDB

in your own sort of infrastructure, your own accounts,

and, you know, you you can have, like, an activity

wrapper that can invoke that and and get the context from there and do the checks on freshness, etcetera.

One thing we haven't done yet, and this is this is a definitely a topic of conversation internally is this question of, you know, temporal is a general purpose durable execution platform.

Should we be thinking about

building

data specific abstractions

that really take some of these patterns we're seeing and helps developers

do that more easily?

And, honestly, you know, for us, this is a a big discussion

because we have sort of stayed a little agnostic

around being very opinionated about things, and we feel like that really helps with a lot of the developer

sort of empowerment and creativity

and and control over the way they wanna implement their use case. But, you know, there's definitely a question on the table here for us around these abstractions,

and

should we be thinking about,

you know, building some abstractions that make that sort of data prep pattern a little bit more sort of in product out of the box versus maybe a best practice or a sample or a demo.

And when I was preparing for this conversation, I was reading through some of the blog posts on the Temporal

blog about some of the ways that using Temporal as some of the state management for agentic applications

reduces some of the complexity of

actually building those systems because it can be the system of record for the various conversational flows for the executions

and tool uses of the models and then being you

a failure of an agent to be able to successfully execute a tool call, you have that complete state, whereas

a number of the agentic frameworks

want to

be the owner of that information, or maybe it is by default, all going to be in process or in memory. And I'm curious how, with the introduction of durable execution,

those frameworks

can be used more effectively or just maybe some of the ways that the introduction of this pattern is help shifting the ways that people are thinking about the design of those types of systems?

Yeah. Absolutely. You know, to to someone who is spending their, day and night thinking about durable execution, the the fact that it's all in memory and a crash might mean that the the user has to start all over again. Like, that is,

sends shivers down my spine for sure. So we you know, what we're seeing is that the frameworks

don't have durability

in place, and and this is where we're doing integration.

So we have a first party integration with the OpenAI agent SDK, for instance, that brings durability

into the picture. We are also

honestly seeing,

you know, customers

building agents

without needing to use frameworks.

Right? And so, you know, customers are building these agents just on top of the durable

execution,

abstractions, and foundation.

In particular,

you know, the the the interesting thing is we are still early in the sense of we believe that these agents are going to need to be even more longer lived. You you know, I I like, for instance, there's no reason why the an agent I use is an interactive agent. You know? What I wanna be able to do is give the agent some work, go away, and come back after whatever amount of time and and get my results. And so the the agent patterns

are very, very much kind of the asynchronous,

long running,

durable execution patterns that temporal has been solving

for a very long time. And, we're starting to see that value creation now kind of coming through. So, you know, for the set of developers that do wanna use frameworks,

we are integrating

with frameworks and and, you know, we can kind of bring durable execution to them. But we're also seeing the usage of temporal

in just building

agents.

And the piece that you were referring to, we we I haven't talked much about, but a really compelling part of temporal is

you can go into temporal and you can look at the execution

of your workflow, and you can see exactly, you know, what services were called, what LLM calls were made. And, you know, that visual sort of observability

piece

is a big part of the value. And you can also export this history,

and we have customers that are using that for their audit and compliance needs as well.

One of the other patterns that I've seen

for a similar use case is to

proxy through

an LLM gateway

to be the system of record for all of the interactions

between your application

and the LLM API.

Tensor zero is one in particular that I'm familiar with that actually uses a ClickHouse database as that state store and will then use that information as a means to execute reinforcement learning to do fine tuning of the model that you're using and improve efficiency and effectiveness.

And I'm curious

how

temporal

or teams that are using temporal are maybe doing similar use cases or what you see as some of the trade offs between those two approaches of either using temporal as that state store versus

proxying through the LLM gateway and using something like a dedicated database as that state store? Yeah.

I think it depends on what you're trying to achieve here. I think the that use case

around sort of the the tensor zero piece, you know, you could build that pretty easily on temporal.

And the value of building that on temporal would mainly be kinda getting the resiliency

and the error handling, you know, all of these pieces

that durable execution brings forward.

How you make the decision on whether you use

what what we would sort of call

a a sort of a more opinionated

sort of offering versus a general purpose platform,

I think, is really dependent on what is the end goal that you're trying to achieve.

But we do we do see this pattern

where sort of customers are using temporal

to call the LLM.

And, again, the benefit really is that the actual call gets done

from your worker that runs in your environment. And so if if you've got sort of the complexity of use cases around

private hosted models

or sort of data privacy pieces, etcetera. You know, you kind of the temporal model lets you have full control

on the output of what the LLM is bringing, where you store it, how do you compare it and check it and validate it and so on.

I think one of the key aspects

of temporal

as the state store is that it is also

a programmatic substrate

versus something that is

maybe

more opinionated or constrained

in terms of what it is expecting to do.

And so it gives you broader flexibility in terms of how to take advantage of that state without having to necessarily

do additional integration work to reuse that so you can actually use the data in situ rather than having to do an extraction,

transformation,

either reimport or do a more roundabout means of using that. And I'm curious how that changes the ways that teams are

selecting

the supplemental tools that they're relying on once they do start using temporal.

Yeah. It's a great point because I think the core message here is temporal is a code first, you know,

developer tool. Right? And so I was talking about the programming languages.

Our goal is to meet the developer where they are in the language of their choice. And so the the value here would be that you would be able to fit your temporal usage in your existing

engineering practices,

your

CICD pipelines, how you do testing. You know, it we wanna be able to fit in your processes

without inducing

more overhead for you. And I think that is a big part of the decision making,

criteria that goes in here because we we aren't going to come in and mandate the use of a specific programming language, and we're gonna give you the power of the SDK.

Now the flip of that conversation always is, well, is an opinionated

system sort of faster to use, or or is that better for me? And then in a lot of cases, it might be, and that's totally fine. You know, for us, again, what we are building is a platform

where we believe that the developers

can use the power of the platform

as they see appropriate

and, you know, the the fit in with how they are building their software today.

And for teams who are starting to adopt temporal or they are figuring out how best to design around it, what are some of the key primitives or key

design patterns that you see teams maybe either struggling

to understand

or maybe they are not using temporal

in the most idiomatic

manner and just some of the useful either references or pieces of advice that you have for teams who are starting to tackle that design phase of, okay, temporal seems great, durable execution seems like it would be really useful, but what do I actually do to get started?

Yeah. So this is a really interesting topic because,

it is almost

at you know, the the the first reaction that people have is is, like, it's almost magic, that kind of a reaction. Right? And and what we what we see is

that

once someone understands temporal,

they cannot look back. They are essentially

fully onboard with the programming model. But it it requires them to have an open mind because as an engineer, you've been trained for years to be thinking about all of the error scenarios, and you've got all of this code that is gonna deal with all the reliability

pieces. And then someone comes and tells you, oh, you don't need any of that anymore. You know? Of course, there is some element of, you know, skepticism

and sort of unlearning that needs to happen here. Our recommendation

always is to just, you know, get your hands dirty,

try it, run some samples,

you know, read the blogs. But, inherently,

it's, you know, you've gotta try it. And once you understand

the power of the platform,

then your design decisions

become hugely simplified.

One other thing that we hear a bunch about, especially as we're talking to more of, like, the VP

engineering,

the CIO type audience. You know, one of their first questions is, okay. What do you replace? And what they're trying to do is pattern match. Does this mean I can get rid of my queue here or I can do this or that? And, you know, it's really it's really temporal is yes and and more is kind of the answer. Right? You have to really start thinking about the initial set of use cases

and learn. And then from there, you'll what we find is then, you know, engineers

are applying temporal

across all these domains that we didn't quite imagine that they would, sort of think about use cases for.

And as teams are coming up to speed with temporal or maybe they're evolving in terms of their sophistication of its use and the level of integration into their systems, what are some of the most interesting or innovative or unexpected ways that you've seen temporal and the durable execution pattern used in these data and AI workloads?

So one of the ones that,

I find fascinating and interesting is we had a use case come up where,

someone

was intentionally

killing their workers

because they wanted to optimize

the usage and the cost of the workers. So they would just go in and kill the worker knowing that temporal could pick up wherever

that worker sort of left off. So I I I thought that was fascinating.

Other interesting things that I'm seeing and and, you know, I I have a tremendous amount of respect for folks in the data field is just the sheer volume

of the

data processing that's happening

and how having

a production ready scale

orchestration

durable execution platform is condensing

the time it takes. So we had a customer tell us that the pipelines that were running took, like, eighteen hours to run. They've been able to condense that down to five minutes because

they have temporal has forced them to think about sort of what are all the various steps in this process and how can they

sequence these steps so that they can actually go faster.

And if if something fails, they they don't need to start all over again.

And as you have been working in this space and

working with the company and the community around Temporal

and

understanding

more deeply

the capabilities that it provides

and the use cases that it can be applied to, what are some of the most interesting or unexpected or challenging lessons that you've learned

personally?

Wow. So I think I think the main lesson

for me is if you really think about it, what durable execution and temporal

does is it takes the onus on reliability

to be in the temporal server. And, of course, our business model is one where we run temporal

cloud,

and our responsibility

there

is

really

high because, you know, the the core of the promise we're making is we will handle reliability

for you. And the way we do that is by putting that problem

in the temporal server. And so, of course, the temporal server has to be

incredibly reliable.

And I think that is really the the main it's it's not surprising

when you think about it, but living that with a cloud delivery model, you know, especially with some of the outages that we've been seeing as well

and making sure that we live up to our promise there. You know, that's something that we think about constantly.

And for people who are designing

systems, thinking about how best to manage

the reliability

of their data workflows or their AI applications,

what are the situations where you would advise against the use of temporal or durable execution as a pattern?

I think situations

where

you you know, I know you used the word designing them for reliability,

so

it makes it harder for me to answer that.

I would say

situations where, you know, a crash happens and and you really don't care about it, you know, you may not need temporal at that point. But I think anytime where you need to scale and be reliable,

we do believe that, you know, you should be looking at temporal and the value it brings there.

And I guess put another way, what are the situations where the incorporation

of temporal

adds excessive complexity

or unnecessary

coordination

to an application design?

Again, I think if you've got you know, if you're doing early prototyping

and you have not established

your

business value yet or, you know, you're you're, like, just you've got some toy agents that you're trying to figure out what's really gonna be the core IP. You know? Maybe maybe you don't need temporal there.

I think, you know, I think we are kind of this this notion of the complexity

piece is something that, you know, we are working

towards doing some more sort of DevRel

education

on because

we really do believe that this the complexity thing is like a false

argument here. Because the whole premise around temporal is to make your life easier.

And so, you know, the question really becomes,

you know, where is that complexity coming from? Is it learning temporal?

Is it running temporal?

And we really believe that with all the progress we've made, it should be a nonargument.

But, of course, I work here, and,

you know, I'm willing to be convinced otherwise

and,

continue to work towards improving.

And you mentioned already that you are trying to combat the tendency

to form strong opinions or build excessive integrations

into temporal for specific use cases.

But I'm wondering what are some of the things you have planned for the future of temporal and the durable execution pattern,

in particular, with an eye towards how it can be used in these data and AI systems?

Yeah. So I think the main focus for us is to continue to

improve

sort of the

onboarding

experience

onto temporal,

and, really, the durable execution sort of

execution sort of constructs

that have any rough edges. How do we make sure that we are continuing to sand them down? And all of this is really

tying to our core focus on the developer.

And so what you will see from us is continuing

engagement

on what are the developer pinpoints and scenarios. And I'll give you a concrete example. For instance,

versioning of workflows

and how do you deploy

new versions, how do you sort of incrementally

roll out traffic

across these versions. Maybe that there was a long running workflow on a previous version and you wanna make sure that you can execute

multiple versions and and wait till that workflow is is completed. We're doing a lot of product work around simplifying

that whole space for the developer community.

So that's that's the kind of thing that you'll see from us is around just listening to our community and the developers and just sort of working hard to improve the overall experience.

Are there any other aspects of durable

execution as a pattern,

temporal as an implementation

of that, and the application

of those capabilities to data and AI systems that we didn't discuss yet that you would like to cover before we close out the show?

I think really quickly,

the we did touch on Nexus a little bit during the conversation, but, you know, if if there are folks listening to this that haven't checked out Nexus,

definitely

recommend taking a look there. Nexus is also

in open source, and it this is how we believe sort of organizations

that have teams working on different parts of the stack can actually build applications

that can call across boundaries.

So that that would be my only sort of call out here.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the temporal team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Gosh. Not not living in the space.

I'm not sure I have a great answer for you.

You know, I think one of the things that does come to my mind is just around

effective use of the resources,

you know, whether those are costly GPUs, etcetera. Just how can,

you know, sort of more tooling around helping manage those seems like a pattern. And, we're seeing people use temporal to solve that, and and maybe someday, there will be sort of more opinionated tooling around that.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences

on the ways that temporal and durable execution can be used to simplify

the design and implementation

of these data intensive systems and ways to reduce the complexity

of managing

both the business logic and the resilience and reliability.

So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.

Thank you so much. It was a pleasure to chat with you today.

Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast.init

covers the Python language, its community, and the innovative ways it is being used.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at AI engineering podcast dot com with your story.

Data Engineering Podcast