Bridging the AI–Data Gap: Collect, Curate, Serve

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.

The result, inflexible infrastructure that can't adapt to different workloads.

That's why Cash App and Cisco rely on Prefect.

Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.

Each model runs on the right infrastructure,

whether that's high memory machines or distributed compute.

Orchestration is the foundation that determines whether your data team ships or struggles.

ETL, ML model training, AI engineering,

streaming, Prefect runs it all from ingestion to activation in one platform.

WHOOP and 1Password also trust Prefect for their data operations.

If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.

Composable data infrastructure is great until you spend all of your time gluing it back together.

Bruin is an open source framework driven from the command line that makes integration a breeze.

Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.

Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.

Go to dataengineeringpodcast.com/bruin

today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

Your host is Tobias Macey, and today I'm interviewing Omri Lifshitz and Ito Bronstein about the challenges of keeping up with the demand for data when supporting AI systems.

So, Omri, can you start by introducing yourself?

Hi. Glad to be here. I'm Omri, the cofounder and CTO of Upriver.

And, Ido, how about yourself?

Yeah. So hello.

I'm Ido, and I'm the CEO and cofounder of Upriver.

And, Omri, do you remember how you first got started working in data?

Yeah. So my journey started about

fifteen years back.

It started in the military. I was working in cybersecurity,

and I was fortunate enough to work across the entire value chain of cybersecurity operations

from doing really low level things, reverse engineering, knowing how we're able to get things running wherever we need them, all the way to building the data pipelines, collecting data from these cybersecurity operations.

And a big part of what we had to do is essentially

making sure that we're able to bring the right data at the right time, make this usable to all of these people on top of our data platform. So there, we had to deal daily with challenges of how do you maintain huge scale data pipelines and make this accessible to intelligence officers.

And, Ito, do you remember how you got started in data?

Yeah. So as a way,

I work in the intelligent unit on cybersecurity

operation

and then on the data infrastructure.

I was lucky to lead our internal data platform, the platform that collect all the different data sources that may gather and human image and, like, the variety of data that we have there

and being charged on all the layer of the stack from the infrastructure to the,

data pipeline management orchestration

and then to the application.

And our main goal is it was to bring data in high reliable,

high paced to the intelligence officers in a way that you can use.

And

all those experiences is what led Hongguindai

to, start up with.

And so in terms of

the overall space of data and the growing demands of AI systems,

obviously, there is a lot of additional complexity that's getting layered on top of the inherent complexity of dealing with data systems that we've been struggling with for several decades now. But as we

add AI

to the set of consumers for these various

data platforms and data streams,

what are some of the ways that you're seeing that introduce a gap either in terms of capabilities

or,

structure or just some of the points of friction that we're dealing with as we try to feed this

new set of requirements

into these new consuming systems that are now increasingly

dealing with a broader range of consumers.

Amazing. In a high level perspective,

a hype,

a, really excels the demand for data and organization.

And you can see it from both sides of the pipelines.

First of all, AI enable us to extract data from

any digital asset, images, PDFs, like conversation.

And today,

businesses really understand that they can actually use data in every piece of their business.

So you have enormous amount of data the organization wants to use. And it also help the organization

to understand that now it is time to use this data because, otherwise, it will fall behind. And in the other side, it help us to make this data

much more accessible

because now any person in the organization can take a a CSV

and put it in chat GPT,

ask question, and get, like,

amazing analysis. I don't know. I I'm sure that you have the chance to, like, trying to do some finance or marketing analysis using Chargepty.

It's answer

very sophisticated,

very help you to understand how to use this data. But the middle layer that helps you to take this

vast amount of data as that you have in the sources

that AI lets you collect. And to the point that this data is really usable for the AI, this is still a bottleneck

as that I see today in the market.

And when we are talking with CIOs, CDOs,

this is something that is not solved yet by AI.

So as you pointed out, the introduction of AI

adds new capabilities

to the processing and production of data because of the fact that we can bring in more of these unstructured assets that have typically been very difficult to

operationalize.

And so that's one side of the equation. The other side of the equation being things like rack pipelines or the whole chat with your documents

capability. But as we move more into agentic systems where we need to be able to do things like

manage memory state,

provide

up to date information

to those agents to make sure that the decisions that they're making or the information that they're providing is actually accurate given the context of the problem that they're trying to solve

for. That's two different sides of the coin where AI can help us accelerate our production of data assets, but it also means that we have a higher demand for those data assets. And so if you're not using AI, in that production pipeline,

there's a good chance that you're just going to be drowning in requests or

that you're going to be building agents that are constantly failing to provide useful capabilities because they don't have the context that they need. And I'm just wondering how you're seeing teams

deal with that challenge. And in particular, when you're talking about using AI in the production

phase, how do you prevent that from just causing costs to run rampant?

Yeah. So I think that you on point on how you, like, interpret that what I said.

And the way that I see that that this, like, two parts of pipeline

is disconnected, and this is not a new problem.

Like, the ability to make sense in, like even in a structured data,

it still requires, like, manually pure at the data,

put the right context,

join between different sources, understand what is true, what is not correct data.

And only after you

did this process,

cure the data, someone can really use it. And it is this time for agents or RAG system

or, like, very complex LLM. Someone need

to successfully

connect between the, like, hundreds of data assets that we collect to the semantic of the business,

to the right standard for the business. So when you connect this to agent, they will work with the right context and in a smart and correct way. And this is the exactly the the work of, like, managing the data. This is what we've done in, like, the last twenty years, just focusing on the structural data. Of course, you have

the equivalent from actual data or any other,

use of data that you want to do. And today, like, this is the point that

the agent breaks, said they caused them

not to move from the POC state to the production state because the data in production is a mess in in in its raw phase.

And without

working on ordering this mess in the production, the agent won't work. And every now everyone knows the phrase of garbage in, garbage out. And the production in its raw form, and I think this is correct for most of the regression, is garbage. Someone need to cure the data and make sense out of that.

And the other problem with

using

AI in that production phase is that it can be very straightforward in some cases

to build a proof of concept and say, hey. I threw AI at my data pipeline, and now I've got this text document giving me structured output. But as with any AI driven system, there is a lot of potential for things to go wrong as you try to scale that and operationalize it and actually start to depend on it

for feeding downstream systems. Because

as you introduce

error rates and you feed them together, those error rates compound. So where maybe at the introduction of AI in doing document processing, you're okay with a 2% error rate of maybe it misinterpreted

some of the phrasing of a document that you're processing. But then as you then do more analysis, particularly in an agentic context

of that data. Now that 2% compounds to 5% or 10%, and then all of a sudden, you're playing the worst game of telephone ever, and you're getting complete garbage output to your end users who are then starting to lose faith in your capability to give them the data that they need. And I'm wondering how you're seeing some of the

skills gap manifest in terms of people who are very adept at building production

data pipelines,

but they don't necessarily understand

the

requirements of building production

AI systems where maybe they need to manage things like evaluation or model selection,

or maybe they need to do fine tuning to be able to increase the efficiency of the model that they're using and drive down cost

for something that requires high uptime usage. And I'm just curious how that's manifesting in teams who are just being thrown saying, hey. Pierre, I need to be able to process this data now because we have AI, and it can do it. But there are a lot more

supporting systems that need to be put in place as well.

Yeah. So I think you you nailed it, like, with the difference between doing a POC and saying, okay. I can now process tons of PDFs than to actually making this production ready. And, like, that's a big thing that pea teams need to overcome. And one of the things we need we've needed to deal with when we've built our system that is based on AI as well. Like, how do you actually make sure that you know how to do these evaluations, how to make sure you're building out the right process? And I think one critical thing that has stayed the same is that good engineering is still good engineering. Like, how do you build out a system that you knows to deal with faults? How do you build out the right process to make sure these things are done correctly?

And I think there is still a gap, and even the market is still trying to understand exactly what is the right way of doing this. Doing these events, doing using agents and elements to check agents. There are a lot of things coming in to the system that are now we see, like, in the industry. But I think in general, a lot of teams are moving to this phase where they're starting to understand it requires a different mindset. You need to understand how you productionize these things and do not just have POCs on top of them.

And so

beyond the challenges of using AI and productionizing

that to manage your data feeds, there's also the bigger question and something that I think is broader beyond just being able to process unstructured data assets

is how do you determine what is the data that is actually going to be useful for an AI agent to perform a given task? And I'm wondering how teams are dealing with that side of the equation as well of identifying

what are the data assets,

one, that they have if they don't already have a a decent catalog of it, but, also, how do they ensure that they're wrapping it in the appropriate semantics for the agent to be able to understand

where, when, and how to actually apply it to the problems that they're given?

So that that's a great question, and I think this is exactly like one of the premises that we had when starting Upriver.

The ability to serve if you want to get these models working correctly, you need to give them the right context and the right data. And, oftentimes, organizations,

they might have a data catalog. Usually, that's not updated correctly, I think. Like, a lot of companies have this catalog

phobia or fatigue by now. They don't know exactly which assets they have. They don't know what the semantics of the data exactly is, when they're trying to use this, especially now pushing this to an LLM to actually take on these tasks. So I think one of the key things that we've done in Upriver, and this now goes to how we've also built our system to help do things, is you need to do three different things in order to actually be able to use models effectively and agents correctly. One is collecting the right data. The second and that's a critical piece is curating this into something that actually

encapsulates and captures the ontology and semantics of what you actually have in your system. And the third is how you serve these two models. Correct? So that way you can actually make this usable. And I think if you skip any of these steps in the way, you're just going to get something that doesn't

meet your expectations,

and then you're probably gonna be disappointed from the results you're getting. Because just writing an SQL query is quite easy. If you know exactly how to structure it, what you wanna get from the data, that's quite easy to do. Being able to do these fully agentic flows, we're saying, I wanna build a pipeline that now allows me to check so and so, requires you to understand exactly what you have, what this means, and how this relates between the different entities in your system, and then how do you serve this to model in the right way. And these are the three components that are critical for actually being able to use AI here and making data available to AI.

The other interesting piece is that, generally speaking, as you provide more semantics and more context to the assets that you're producing, it's also beneficial to humans.

The main difference

being

that AIs can process much faster and at a broader scale than an individual human in terms of doing the discovery and doing the interpretation of which data assets to apply for certain use cases? And as we do open up the doors for these AI to do a broader analysis and broader consumption of those data assets,

how does that shift maybe the

visibility or highlight the gaps in the quality or reliability of data assets that have maybe been not necessarily neglected, but at least not used as actively by humans who have a much narrower focus on, oh, I'm going to check this dashboard, or I use this data feed for performing this particular task. And we say, now we can actually take advantage of the broader set of data assets that we've been collecting and maybe not paying as close attention to.

So I think that,

you

phrase it,

exactly as I see. Like, agents and AI

needs the same things that humans

needs when they're accessing the data. That they need that the data will be with no mistakes. They need to have the right semantic with connection to the business context.

And without this, agents, AI, and humans cannot really use the data,

correctly. The things that change is a with AI

is, first of all, as you said, he can process much, much more data. And the second thing, he doesn't have, like, external context. He have only the context that he sees when he access data. He doesn't know, like, what

some other person told him in the kitchen on a coffee when he come to to do the job. Just use the context that someone,

tap into the data.

So

in a sense, the problem

of how we manage the data, how we create the right contact rate, cataloging, semantic,

create, like, a high quality datasets

does not really change, but they need but they need to do it in scale and

fast.

The importance of doing so just accelerate.

Now you need to have proper semantic and proper standards to all your data, not only to the tables that right now the analyst analyst use that. And I think in that sense,

the importance of

managing the data and being able to structure it correctly

is just more is really more important than ever before.

Another interesting aspect of bringing AI to bear is that particularly when we're talking about a data warehouse context, there are established patterns for the structural semantics of the data, whether that's a star schema or a data vault or whichever

tribe you are a member of. And I'm curious how

using AI as the access layer versus

human analyst who is handcrafting these SQL queries or a BI tool that is using these dimensional structures to do a visual navigation,

are those still beneficial? Are those still the best way for us to be thinking about structuring the overall data assets, or do

the semantics and capabilities

of these AI systems and AI agents change the ways that we need to be thinking about the foundational structure and the foundational

semantics of the data assets that we're producing?

Great question. So first of all, I think that nobody,

still figure it out. What is the right architecture

for doing effective data warehouse for AI? I'm sure that there are things that will be preserved. For example, you need to create some kind of

intermediate stables in order

to be able to create this data efficiently.

And I'm sure that the ability to curate the data like bronze, silver, gold, in a sense, will stay because we want to clean the data and then put the right semantic.

But

exactly if star schema,

is the correct

architecture for AI,

I'm not sure.

I think that

it's things that we see every organization

solve differently today, especially try to stitch AI on its current architecture,

and it's work. Like, you don't need to change all your, architecture in order that AI will be able to process data. You just need to

make the data reliable,

high quality with the right

semantic, and we will see if well it will evolve.

Yeah. And I will add to that that one of the things that you did talk about, like, the tribes you have, h one, believing they have the right

the right schema and the right way to structure it, Snowflake, star schema, and so on. I think one of the interesting things we saw while working with companies

is that the models are very good at capturing the way in which you've structured your your data and work on top of that once you've put a good data model in place. So the need to do it and to define what your data model is and how you want it to look like, that's still something we'll need to do. I think the difference between whether we're doing this way or this way, the discrepancy will probably become,

much smaller over time because the models are able to understand it. So there it's kind of a preference. And as you just said, we're still not sure what the best schema will be and what the best data model will be going forward. But we do see that even when you're using different things, the LLMs are able to capture the essence of the data from it.

Another interesting aspect of the fact that we do have these AI systems

consuming the data and we need more data to be able to make them as useful as possible is the challenges of also not wanting to flood their context window. Because if you give it too much data, then it goes from being your smart sidekick to your dumb sidekick.

And the contrary is also true where it could be your dumb sidekick if you don't give it enough data, and I'm wondering how you're seeing teams think about that balance as well of either not feeding it too much data because it's not able to differentiate the useful stuff from the not useful stuff or making sure that you're

not starving out of data and just figuring out what that balance is. And, obviously, I'm sure the answer is it depends, but I'm wondering how you're seeing teams go through that discovery process and then maintain that balance as their systems evolve and as their data feeds evolve.

Yeah. So I think one of the critical things here is, and I talked a bit about this earlier, is the fact that you have to both curate the context for the kind of tasks you wanna do. And then there is a critical aspect of how do you serve this context out. So how do you make sure that you're delivering the right context at the right time? And maybe sometime that means you wanna,

minimize your context, like, the context you already have, summarize that, and then move it to a kind of sub agent architecture,

which focuses on something specific. So there are a lot of things in which you need to deal with that in order to actually solve the issue. And that means, how do I map the context that I wanna bring together for a certain task? Once I have that, how do I wanna continue maintaining this over a conversation and a task for an agent? There are a lot of different nuances there. And, again, I didn't wanna give the answer you said, but it depends exactly what you wanna do. But the curation and the serving aspect of this context, I think, are two of the most critical things we're seeing today with how people are engaging with agents and LMs, and this will be something going forward as well.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.

Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold

to book a demo and see how they turn months long migration nightmares into week long success stories.

And the other piece of it too is that because we have access to AI and we can start pulling in these unstructured data assets,

it also

necessitates that we process those so that we can have as much data as we can to feed into these systems.

And then, also,

it opens up the possibility because of the potential for increased developer productivity to start ingesting external and third party data sources and not only relying on

organizational internal data. And I'm curious how you're seeing teams tackle that as well of even seeing what data is available outside and then evaluating and identifying which assets are going to be useful for their particular problem space.

Yeah. So

I think that's, like, is

a two sided source.

With one head, teams have, like, access to much more data than they have before. Like, Like, scraping the web is easier,

and connecting to a new

system and creating the connector is much easier.

So

they have much more data.

And

then

they need

to understand how to make this like,

use this data. And this have two main problems. First of all, they need to understand,

how to connect the data that they gather to their already enterprise data, first party data. This is a problem that takes the data

engineering team a lot of time. And

secondly,

as they need to understand

how to make all of this flow all the time in a very reliable way. And this is also hard because as you know, the output of the LM is all not all always deterministic. And if you're scaling the web, the website can change and it can affect the data.

So after you already succeed to understand how to use this data, making this reliable is the second challenge. And but the amazing thing is that it's open

the

amount of use cases to much more. Right now, you can do fraud detection with external data and not just based on the activity

of the payments that you see in your system. You can do target marketing

for using, like, reviews in Facebook or other social network, and we see all this use case. So the possibilities

to do things with data just increase significantly.

It's good. And other challenges of the data engineering team because the bottleneck is not on

the collection of the data.

It's on the ability to process and really use that.

And beyond just the data and circling back around to the skills gap for teams who are responsible for providing that data,

what are some of the ways that they should be thinking about either bringing on additional talent or working more closely with other teams in the organization

or investing

in

new skills development for helping them figure out how best to actually

work with or understand the problem space so that they're not just left flailing or,

you know, trying to bail water out of a sinking ship.

Yeah. And I think that what you described now is the feeling of a lot of data engineer teams that we are talking with

them

in the routine. I think the most important thing is to understand

how they can use AI

in order to excel their work because

AI open for the organization,

also for communities

to change and to innovate, but also open

the the same opportunity to the data engineering team themselves. And

the nice thing that AI agents allow us is to stitch in a smart way a lot of different pieces.

And as you know, like, a lot of the technical difficulties

of managing data pipeline is the fact that you have data spread a lot of across different tools, across different stages,

and

I can really help you with that. I can really understand,

which tool you are using,

understand the data, connect between the pieces, and in a sense,

make all the tedious work of data engineers

much smaller. So what they see in advance

data engineering team, for example, a team in

Netflix, in Wix,

they really leverage AI in a smart way in order

to obstruct

a lot of the technical

details in building their pipelines

so data engineers can focus

on

understanding the business, understand what is the next business use case that we can use, which data we need to bring in in order to improve the fraud detection

or the conversion of our product. So in the all the

technical

work that data engineers doing today is done on this team with AI.

I think another interesting

aspect of the era that we're in right now and some things that I've seen in my own work is that the introduction of AI

as an end user facing capability in particular

is forcing

more of this combining

of different teams that maybe have worked in their own particular areas and re forcing more of that cross team collaboration

because there

is no longer

as clean of a handoff and those lines are blurring. And so

you have to move from application development

to data management because you need the data in the application to power the the AI, and you also need to bring in the machine learning or data science teams because they're the ones who understand how to deal with the experimentation

and

deal in that more probabilistic space.

And how are you seeing that shift maybe the organizational structures as well where maybe we've gone from having these dedicated teams to everybody is one team, or maybe you're doing more of the,

embedding strategy where you maybe group clusters of people who are in separate teams into their own little operational units to be able to have more of that end to end context and capability

without having to have as much of a hard handoff between those stages?

Yeah. So the first impact that they see in organization,

and it's really, like, you can see it actually in every big organization.

The amount of data people that's, like, trying to bring in and the people that really understand data is much larger. So you see the organization trying to bring the best data scientists, best data engineers, best data analytics

so they can really use their data. This is the first thing. The second thing is that all the organization

start to talk in the data language.

Suddenly,

the software engineers start to understand

that data that they produce really bring value to organization.

So you see a lot of organization,

starting to implement data contract and ability to manage the data across different

part of the,

application.

And

the last thing that we see that we are seeing squads

of, like, application engineer, data engineer, data scientist

working together on the same task. When in the past, it was much separate. It was like the application

engineer work. He threw data

somewhere, then the data engineer ordered that, and then the data scientist do something with that. Now we see much more squad teams that taking a specific task and work it together in order to solve it. And another interesting

approach that I can see being feasible

is maybe rather than restructuring all of those teams, taking some of the top performers from each of them and having them be an enablement squad and maybe defining some useful templates or

context

structures for being able to feed into agents that those teams can actually use

to do their own work or act as an automated reviewer of the work that they're doing or for that team to do more of a,

architectural consulting approach of

evaluating what are the current systems and helping to identify

some of the new capabilities or new workflows that need to be developed to make sure that everybody can operate at a higher level.

Yeah. So I think we're in the being beginning of the journey. Right? So in in the beginning, you are putting the best and the most talented people

on the innovation task. And I think this is what you tried to say, like, taking the best group of people that you have and tries and say, we'll, like, understand

how to take those POC into production first so all the organization can follow be follow follow after.

And in your own work at Uprover,

where you are trying to be more of that enablement layer and help teams automate some of the data work that needs to be in place to actually

enable them to explore this new

AI era. What are some of the core engineering challenges that you faced in terms of being able to actually build a system that moves more of these data workflows into an autonomous layer or works with those teams to be able to reduce some of the drudgery or automate discovery or,

those those various capabilities?

Yes. So I think one of the most interesting things we saw is that we kind of when we started out, we expected the models to be kind of like a magic black box that you can just throw whatever you want at them. And I think a lot of people had that illusion at the beginning where you can say, okay. I'll throw everything into the new GPT model, and I'll be able to get things working. And that obviously is not the way to go about it. And we've had to solve a lot of very big engineering tasks in order to make our our system actually usable and build, like, the right teams around it regarding how do we understand which context we need to collect from the environment we're connecting to. So this, it means also collecting

code information. How do we profile data and bring that context in to make sure that what the system is doing is reliable? How do we put in place the right validations in the system? And when do we need to go and then get the human in the loop to get all of these things working correctly? I think being able to do all of these things and understanding how to put them in place was one of the biggest challenges we had to face here. And when we look, for example,

one of our I think the most interesting things we did was we kind of reverse engineer how cloud code was working and client and all of those coding agents are working. And what you see is that essentially not only are they very good with the models, they know how to make the model and the full system work like a software engineer does. And you need to have that kind of mentality. You need to understand how somebody doing this task would work and then know how to tie in all of the relevant things from there. And again, as I said earlier, like, the idea of how you build and engineer things and the system architecture you have to do it in, I think that still is one of the most critical things that we have today. And that's where, like, you see the really good software and data engineers being able to elevate how things are being done. I think that's critical piece that that is still going to go forward with us as we automate more things and as we bring in new tools that can take a lot of their busy work away from data and software engineering team.

Another aspect of this overall space is

as is the case with software engineering, but I think more critically in, like, a data engineering context,

the fact that every team has their own opinions of what tooling they want to use, what platforms they're building on top of. Obviously, there are broad level abstractions that we can use as commonalities. But when you get into the implementation specifics, there are wide variances in terms of how those technologies operate, especially when you move between batch and streaming.

And I'm curious how you're seeing

that play out as far as what are maybe some of the core primitives that are most essential to be able to actually feed into those AI systems where maybe batch has been working well enough for human driven analytics, but maybe AI is a forcing function for a broader migration to streaming or maybe vice versa. I'm just curious what you're seeing there.

Yeah. So I think one of the, again, it goes how to how do we curate this context? One of the key things that's relevant is what abstractions do you need when you come to approach data engineering tasks. So a batch pipeline is a batch pipeline no matter what what underlying technology somebody's using for it. And the data model, you have all of the semantics within it, but still, you need to understand how things rely together and how the dependencies between them. And I think that's one of the critical aspects. Understanding the ontology of the platform and what abstractions you wanna put in place for it one of the critical things you need to do in order to actually make this usable across different data teams. And I think that's also one of the things you see in software. Like, you might have people writing code in Python, other people writing code in Go. Still being able to understand how you need to look at this modularity

is one of the critical things that you need to do in order to get these systems working. And, definitely, I think we will see more use cases for streaming if people are pushing towards more real time things and on the go analytics that we're seeing now. But the idea of how you abstract away these different things so you can make the model work correctly, I think that's a core that's going to stay across nearly

every aspect of agents working.

And

what are some of the key struggles that you're seeing teams deal with as they are going through all of this effort of building these AI powered systems, getting their data up to par, figuring out what data assets they have and which ones are actually useful?

I think one of the biggest challenges, and I think this goes to, like, data maturity that companies have, a lot of companies are now trying to say, I wanna use AI to enable my data teams, but they're still not clear on what they wanna do with the data. And I think that's, like, the first principles and fundamentals of what you want to do, and then you are able to build things on top of that to actually make this usable. So no matter how much AI and tools you put in place, if you don't know what you want to achieve, I think that's not going to work. Once companies do know how to do that, we see them trying to use the existing software engineering tools, which are amazing. We're also using them, but they're just not the right tools for data engineering tasks because they lack this context. So we see teams trying to manually

prompt all of the context that's relevant for data into cursor before going on each task they need to do. But we've seen teams taking lineage screenshots from other systems and putting that as prom as part of the promp into CheckatGPT

to then get it to write the right SQL. So there are a lot of issues there where you still need to understand how you bring in the knowledge that the data engineer has into the system to actually make them do the right task that you want them to.

And as you have been working with some of these teams and onboarding them onto Upriver and helping them understand the overall problem space, what are some of the most interesting

or innovative or unexpected ways that you've seen companies trying to make their data AI ready?

So I think that we see companies trying to

tackle this

across all the maturity,

levels from companies that hiring more data quality analysts that now needs to check all the data and all the records and create all the semantic layer, like, manually,

just from, like, sticking with all the business and trying to dynamically update that. We see companies

that try to use as a reset, like, automation

that software engineer uses,

especially like Cloud Code and Karsall.

But then they are trying to understand how to gather context

about the data using, like, MCPs

and

screenshots and some resets for, like, data systems. And we see very mature companies, for example, Netflix

and, like, Wix. They built, like, their own automation tools in order to understand what is this context. For example,

we heard we heard how they're, like, connecting to Slack and Notion

and creating all the data documentation

automatically

based on the

unstructured

organizational

and knowledge that they have in other system. This is, like, I think, very cool example that we have.

And in your own work of building the system

to be more of that autonomous

data engineer

and

exploring this overall space of how do we actually use AI to build these systems, what are the AI systems that are consuming from this actually need, and how do we make sure that they're speaking the same semantics? What are some of the most challenging or unexpected lessons that you've learned in that process?

So as we said, that there is no magic. Like, AI is not magic. You still need

to

properly build the context of the knowledge that you want that AI will use. So

for us, the biggest challenge is to understand how to create route ontologies

from the context and understand

how to create the right connection between different

entities. For example, what is a pipeline? What is a table?

What what is an entity inside the tables?

And then

to help the AI understand those technologies

and use

the specific context that we gather from our customer environment in order to enable

the AI to his use case, to his current specific task that he's right now asking for. And I think this is the

the challenges that we solve and craft. Building the right context is

the

things that makes our product really work for data engineer.

Yeah. And I will add to that one thing just to

because I think it's it's interesting in the way, like, you build the agentic systems. Not everything has to be directly done with the LLM. Right? There are things that you do know that are not that are going to be deterministic in your system, and you want them to be deterministic. And those are things that just regular software engineer can still do. And being able to play on the final of what is deterministic

and where can I use the LLM to enrich my capabilities

by miles? I mean, like, that gives you the ability to look at code and understand it. But then maybe I need to go back to a validation module inside to make sure the pipeline was done correctly. Being able to ping pong between these two modes of working, I think, is what gives the edge to really production ready AI system rather than just somebody

throwing things into an MLM model and expecting it to work correct.

And for teams who are

excited about the potential for AI to automate their data engineering work and let them focus on more of the business oriented aspects, what are the cases where an agent based or autonomous approach is actually

not the right way for these teams to be dealing with their data pipelines and data issues?

So it's depending on the maturity

of the team. In order that the team will need AI in order to manage his data is when he knows what is the exact

business problem that he want to solve,

and he wants now to deliver that. Teams that just right now understanding

what the business need and trying to map their data and understand which data they have in the organization.

Their work is more focused on the business

and the ability to

understand,

how they can enable the business. But once the focus move to the technical part of building its data, structuring it, and managing that, this is where AI can really hit in. Because the build standard statement

will always or I think will stay with the human for a lot of time. But technical

will be abstract over time, and this is where AI can really enable data engineering team.

And as you continue to invest in this problem area and as the overall industry

progresses further into our understanding

of how and when to use AI and what the requirements are of the models as they continue to evolve in terms of their capabilities,

what are some of the trends that you're paying particularly close attention to or any problem areas or

new engineering patterns that you are investing in or doing exploration of?

Yes. So I think, one of the major thing that came up, I think, like, in the last few months that we're really focusing on is sub agents and the ability to kind of send, okay. What context do I need to derive specific tasks within

my overall Altium task? I think that we've started using this. This has given a lot of value right now, and we see this going forward. And how do you manage these things?

I think that's one thing that will change. The second thing is how do we manage context as models are able to gain bigger context windows, and they're starting to gain memory with as part of the model as well. That will also kind of change the way in which we need to manage the context we're giving it. So that is a big thing we're paying attention to and seeing how this will affect the entire industry. And, again, I think, like, in essence,

the idea of how we store the context that will that will probably change in the next year or so. We see things moving so fast. But the idea of how we curate it and understand what context we need and what it means to actually know your data state, And that is essentially what a data engineer does best. Right? Knows what is actually happening in their data environment, what semantics and metrics. And so when they're trying to do and what they're trying to calculate, that will stay. And I think we're trying to understand what is the best way as the models and the paradigms change to be able to deliver this value.

Are there any other aspects of this overall

situation that we're in of the gap between

the needs of these AI systems and the

ability for data teams to be able to fulfill those requirements that we didn't discuss yet that you'd like to cover before we close out the show?

I think yes. And I would love to, like, picture

picture the image of how I see this space in couple of years for now. So if I'm looking on this space today, today,

most of the data management

work is done by data engineers. And you have, like, tools that helps you from data quality, data cutout, to semantic layer. In four years for now, I see it completely changed. I think that you will have an AI based platform that helps you to really

automate how data manage

from end to end, from the minutes that data land in your platform, to the quality, to the catalog, to semantic, to the building the pipelines themselves. And data engineers will oversee

those AI engines and will define what the business need. They will define which data they want to enter the platform. And they, of course, control it and, like, create the right

ontologies and the right metrics

for the business.

And this layer will be much more focused on what the business need and must latch focus on how I

tactically

create this pipeline and manage that. And we see the first step for that right now in the industry, especially in the very advanced companies, that

in four years, this will change how data management,

is done. Yeah. I definitely agree with that, and I'm seeing that in my own work where

I am very

knowledgeable about how these systems play together, but I don't want to get bogged down in the details of figuring out, oh, well, I missed this character in this batch call, or, oh, I need to make sure that I add this particular

function call or this particular annotation

to this model to make sure that it gets processed downstream. I just wanna solve the overarching problem and let the AI deal with those granular details. But I am there as the supervisor to make sure that it's not making foolish mistakes or going down the wrong rabbit hole, which is, I think, a a skill that people who have been very focused on the individual contributor track are going to need to really level up into because they haven't gotten used to that aspect of being more of that management role and overseeing other people's work where, in this case, the the, quote, unquote, people are the AI and just moving

beyond maybe whatever ego they may have attached to the craft of their specific

software practices or architectural patterns

and focusing more on those

broader

objectives

that they maybe don't have the time

to actually think a lot about because they're so busy with those granular details.

Exactly. And I think that the industry will more grow more and more for that area. Alright. Well, for anybody who wants to get in touch with you both, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So, again, the biggest gap today is the ability

to stitch the pieces Still today,

ability to,

with one hand, understand the data and understand what it mean. And in the other hand,

understand

orchestration,

like, DBT transformation,

partitioning

in storage,

and cataloging the data and data quality. Seeking all of that is a hard problem. Starting to hold all these, like, tools and orchestrate everything in the correct way, it's just too complicated today.

Yeah. Absolutely.

We are continuing to add more layers of complexity, and we're componentizing so you don't have to necessarily have the entire vertical complexity of some of the tools that we've dealt with over the time. But it just means that that knowledge has to be more diffuse.

Yeah. Yes. And I from what we see, AI helps you to really solve that, solve connected with between part and AI can use this tool and learn one time and do it, like, in every in every company.

Alright. Well, thank you both very much for taking the time today to join me and share your experiences

of working in this space and your thoughts on the demands and shortcomings

of the data systems that we have as we start to bridge into these more AI driven capabilities. It's definitely a very necessary field to be studying and focusing on. So I appreciate all of the work that you're both doing to help make that a more tractable problem for a broader set of engineers.

Thank you very much. We really enjoy it. Yeah. Yeah. Thank you for having us. Thank you.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. Notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and colleagues.

Data Engineering Podcast