In this episode of the Data Engineering Podcast Omri Lifshitz (CTO) and Ido Bronstein (CEO) of Upriver talk about the growing gap between AI's demand for high-quality data and organizations' current data practices. They discuss why AI accelerates both the supply and demand sides of data, highlighting that the bottleneck lies in the "middle layer" of curation, semantics, and serving. Omri and Ido outline a three-part framework for making data usable by LLMs and agents: collect, curate, serve, and share challenges of scaling from POCs to production, including compounding error rates and reliability concerns. They also explore organizational shifts, patterns for managing context windows, pragmatic views on schema choices, and Upriver's approach to building autonomous data workflows using determinism and LLMs at the right boundaries. The conversation concludes with a look ahead to AI-first data platforms where engineers supervise business semantics while automation stitches technical details end-to-end.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Omri Lifshitz and Ido Bronstein about the challenges of keeping up with the demand for data when supporting AI systems
- Introduction
- How did you get involved in the area of data management?
- We're here to talk about "The Growing Gap Between Data & AI". From your perspective, what is this gap, and why do you think it's widening so rapidly right now?
- How does this gap relate to the founding story of Upriver? What problems were you and your co-founders experiencing that led you to build this?
- The core premise of new AI tools, from RAG pipelines to LLM agents, is that they are only as good as the data they're given. How does this "garbage in, garbage out" problem change when the "in" is not a static file but a complex, high-velocity, and constantly changing data pipeline?
- Upriver is described as an "intelligent agent system" and an "autonomous data engineer." This is a fascinating "AI to solve for AI" approach. Can you describe this agent-based architecture and how it specifically works to bridge that data-AI gap?
- Your website mentions a "Data Context Layer" that turns "tribal knowledge" into a "machine-usable mode." This sounds critical for AI. How do you capture that context, and how does it make data "AI-ready" in a way that a traditional data catalog or quality tool doesn't?
- What are the most innovative or unexpected ways you've seen companies trying to make their data "AI-ready"? And where are the biggest points of failure you observe?
- What has been the most challenging or unexpected lesson you've learned while building an AI system (Upriver) that is designed to fix the data foundation for other AI systems?
- When is an autonomous, agent-based approach not the right solution for a team's data quality problems? What organizational or technical maturity is required to even start closing this data-AI gap?
- What do you have planned for the future of Upriver? And looking more broadly, how do you see this gap between data and AI evolving over the next few years?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. WHOOP and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud. Your host is Tobias Macey, and today I'm interviewing Omri Lifshitz and Ito Bronstein about the challenges of keeping up with the demand for data when supporting AI systems. So, Omri, can you start by introducing yourself?
[00:02:09] Omri Lifshitz:
Hi. Glad to be here. I'm Omri, the cofounder and CTO of Upriver.
[00:02:13] Tobias Macey:
And, Ido, how about yourself?
[00:02:15] Ido Bronstein:
Yeah. So hello. I'm Ido, and I'm the CEO and cofounder of Upriver.
[00:02:21] Tobias Macey:
And, Omri, do you remember how you first got started working in data?
[00:02:25] Omri Lifshitz:
Yeah. So my journey started about fifteen years back. It started in the military. I was working in cybersecurity, and I was fortunate enough to work across the entire value chain of cybersecurity operations from doing really low level things, reverse engineering, knowing how we're able to get things running wherever we need them, all the way to building the data pipelines, collecting data from these cybersecurity operations. And a big part of what we had to do is essentially making sure that we're able to bring the right data at the right time, make this usable to all of these people on top of our data platform. So there, we had to deal daily with challenges of how do you maintain huge scale data pipelines and make this accessible to intelligence officers.
[00:03:07] Tobias Macey:
And, Ito, do you remember how you got started in data?
[00:03:10] Ido Bronstein:
Yeah. So as a way, I work in the intelligent unit on cybersecurity operation and then on the data infrastructure. I was lucky to lead our internal data platform, the platform that collect all the different data sources that may gather and human image and, like, the variety of data that we have there and being charged on all the layer of the stack from the infrastructure to the, data pipeline management orchestration and then to the application. And our main goal is it was to bring data in high reliable, high paced to the intelligence officers in a way that you can use.
And all those experiences is what led Hongguindai to, start up with.
[00:03:55] Tobias Macey:
And so in terms of the overall space of data and the growing demands of AI systems, obviously, there is a lot of additional complexity that's getting layered on top of the inherent complexity of dealing with data systems that we've been struggling with for several decades now. But as we add AI to the set of consumers for these various data platforms and data streams, what are some of the ways that you're seeing that introduce a gap either in terms of capabilities or, structure or just some of the points of friction that we're dealing with as we try to feed this new set of requirements into these new consuming systems that are now increasingly dealing with a broader range of consumers.
[00:04:47] Ido Bronstein:
Amazing. In a high level perspective, a hype, a, really excels the demand for data and organization. And you can see it from both sides of the pipelines. First of all, AI enable us to extract data from any digital asset, images, PDFs, like conversation. And today, businesses really understand that they can actually use data in every piece of their business. So you have enormous amount of data the organization wants to use. And it also help the organization to understand that now it is time to use this data because, otherwise, it will fall behind. And in the other side, it help us to make this data much more accessible because now any person in the organization can take a a CSV and put it in chat GPT, ask question, and get, like, amazing analysis. I don't know. I I'm sure that you have the chance to, like, trying to do some finance or marketing analysis using Chargepty.
It's answer very sophisticated, very help you to understand how to use this data. But the middle layer that helps you to take this vast amount of data as that you have in the sources that AI lets you collect. And to the point that this data is really usable for the AI, this is still a bottleneck as that I see today in the market. And when we are talking with CIOs, CDOs, this is something that is not solved yet by AI.
[00:06:22] Tobias Macey:
So as you pointed out, the introduction of AI adds new capabilities to the processing and production of data because of the fact that we can bring in more of these unstructured assets that have typically been very difficult to operationalize. And so that's one side of the equation. The other side of the equation being things like rack pipelines or the whole chat with your documents capability. But as we move more into agentic systems where we need to be able to do things like manage memory state, provide up to date information to those agents to make sure that the decisions that they're making or the information that they're providing is actually accurate given the context of the problem that they're trying to solve for. That's two different sides of the coin where AI can help us accelerate our production of data assets, but it also means that we have a higher demand for those data assets. And so if you're not using AI, in that production pipeline, there's a good chance that you're just going to be drowning in requests or that you're going to be building agents that are constantly failing to provide useful capabilities because they don't have the context that they need. And I'm just wondering how you're seeing teams deal with that challenge. And in particular, when you're talking about using AI in the production phase, how do you prevent that from just causing costs to run rampant?
[00:07:47] Ido Bronstein:
Yeah. So I think that you on point on how you, like, interpret that what I said. And the way that I see that that this, like, two parts of pipeline is disconnected, and this is not a new problem. Like, the ability to make sense in, like even in a structured data, it still requires, like, manually pure at the data, put the right context, join between different sources, understand what is true, what is not correct data. And only after you did this process, cure the data, someone can really use it. And it is this time for agents or RAG system or, like, very complex LLM. Someone need to successfully connect between the, like, hundreds of data assets that we collect to the semantic of the business, to the right standard for the business. So when you connect this to agent, they will work with the right context and in a smart and correct way. And this is the exactly the the work of, like, managing the data. This is what we've done in, like, the last twenty years, just focusing on the structural data. Of course, you have the equivalent from actual data or any other, use of data that you want to do. And today, like, this is the point that the agent breaks, said they caused them not to move from the POC state to the production state because the data in production is a mess in in in its raw phase.
And without working on ordering this mess in the production, the agent won't work. And every now everyone knows the phrase of garbage in, garbage out. And the production in its raw form, and I think this is correct for most of the regression, is garbage. Someone need to cure the data and make sense out of that.
[00:09:43] Tobias Macey:
And the other problem with using AI in that production phase is that it can be very straightforward in some cases to build a proof of concept and say, hey. I threw AI at my data pipeline, and now I've got this text document giving me structured output. But as with any AI driven system, there is a lot of potential for things to go wrong as you try to scale that and operationalize it and actually start to depend on it for feeding downstream systems. Because as you introduce error rates and you feed them together, those error rates compound. So where maybe at the introduction of AI in doing document processing, you're okay with a 2% error rate of maybe it misinterpreted some of the phrasing of a document that you're processing. But then as you then do more analysis, particularly in an agentic context of that data. Now that 2% compounds to 5% or 10%, and then all of a sudden, you're playing the worst game of telephone ever, and you're getting complete garbage output to your end users who are then starting to lose faith in your capability to give them the data that they need. And I'm wondering how you're seeing some of the skills gap manifest in terms of people who are very adept at building production data pipelines, but they don't necessarily understand the requirements of building production AI systems where maybe they need to manage things like evaluation or model selection, or maybe they need to do fine tuning to be able to increase the efficiency of the model that they're using and drive down cost for something that requires high uptime usage. And I'm just curious how that's manifesting in teams who are just being thrown saying, hey. Pierre, I need to be able to process this data now because we have AI, and it can do it. But there are a lot more supporting systems that need to be put in place as well.
[00:11:39] Omri Lifshitz:
Yeah. So I think you you nailed it, like, with the difference between doing a POC and saying, okay. I can now process tons of PDFs than to actually making this production ready. And, like, that's a big thing that pea teams need to overcome. And one of the things we need we've needed to deal with when we've built our system that is based on AI as well. Like, how do you actually make sure that you know how to do these evaluations, how to make sure you're building out the right process? And I think one critical thing that has stayed the same is that good engineering is still good engineering. Like, how do you build out a system that you knows to deal with faults? How do you build out the right process to make sure these things are done correctly? And I think there is still a gap, and even the market is still trying to understand exactly what is the right way of doing this. Doing these events, doing using agents and elements to check agents. There are a lot of things coming in to the system that are now we see, like, in the industry. But I think in general, a lot of teams are moving to this phase where they're starting to understand it requires a different mindset. You need to understand how you productionize these things and do not just have POCs on top of them.
[00:12:47] Tobias Macey:
And so beyond the challenges of using AI and productionizing that to manage your data feeds, there's also the bigger question and something that I think is broader beyond just being able to process unstructured data assets is how do you determine what is the data that is actually going to be useful for an AI agent to perform a given task? And I'm wondering how teams are dealing with that side of the equation as well of identifying what are the data assets, one, that they have if they don't already have a a decent catalog of it, but, also, how do they ensure that they're wrapping it in the appropriate semantics for the agent to be able to understand where, when, and how to actually apply it to the problems that they're given?
[00:13:31] Omri Lifshitz:
So that that's a great question, and I think this is exactly like one of the premises that we had when starting Upriver. The ability to serve if you want to get these models working correctly, you need to give them the right context and the right data. And, oftentimes, organizations, they might have a data catalog. Usually, that's not updated correctly, I think. Like, a lot of companies have this catalog phobia or fatigue by now. They don't know exactly which assets they have. They don't know what the semantics of the data exactly is, when they're trying to use this, especially now pushing this to an LLM to actually take on these tasks. So I think one of the key things that we've done in Upriver, and this now goes to how we've also built our system to help do things, is you need to do three different things in order to actually be able to use models effectively and agents correctly. One is collecting the right data. The second and that's a critical piece is curating this into something that actually encapsulates and captures the ontology and semantics of what you actually have in your system. And the third is how you serve these two models. Correct? So that way you can actually make this usable. And I think if you skip any of these steps in the way, you're just going to get something that doesn't meet your expectations, and then you're probably gonna be disappointed from the results you're getting. Because just writing an SQL query is quite easy. If you know exactly how to structure it, what you wanna get from the data, that's quite easy to do. Being able to do these fully agentic flows, we're saying, I wanna build a pipeline that now allows me to check so and so, requires you to understand exactly what you have, what this means, and how this relates between the different entities in your system, and then how do you serve this to model in the right way. And these are the three components that are critical for actually being able to use AI here and making data available to AI.
[00:15:15] Tobias Macey:
The other interesting piece is that, generally speaking, as you provide more semantics and more context to the assets that you're producing, it's also beneficial to humans. The main difference being that AIs can process much faster and at a broader scale than an individual human in terms of doing the discovery and doing the interpretation of which data assets to apply for certain use cases? And as we do open up the doors for these AI to do a broader analysis and broader consumption of those data assets, how does that shift maybe the visibility or highlight the gaps in the quality or reliability of data assets that have maybe been not necessarily neglected, but at least not used as actively by humans who have a much narrower focus on, oh, I'm going to check this dashboard, or I use this data feed for performing this particular task. And we say, now we can actually take advantage of the broader set of data assets that we've been collecting and maybe not paying as close attention to.
[00:16:19] Ido Bronstein:
So I think that, you phrase it, exactly as I see. Like, agents and AI needs the same things that humans needs when they're accessing the data. That they need that the data will be with no mistakes. They need to have the right semantic with connection to the business context. And without this, agents, AI, and humans cannot really use the data, correctly. The things that change is a with AI is, first of all, as you said, he can process much, much more data. And the second thing, he doesn't have, like, external context. He have only the context that he sees when he access data. He doesn't know, like, what some other person told him in the kitchen on a coffee when he come to to do the job. Just use the context that someone, tap into the data.
So in a sense, the problem of how we manage the data, how we create the right contact rate, cataloging, semantic, create, like, a high quality datasets does not really change, but they need but they need to do it in scale and fast. The importance of doing so just accelerate. Now you need to have proper semantic and proper standards to all your data, not only to the tables that right now the analyst analyst use that. And I think in that sense, the importance of managing the data and being able to structure it correctly is just more is really more important than ever before.
[00:17:58] Tobias Macey:
Another interesting aspect of bringing AI to bear is that particularly when we're talking about a data warehouse context, there are established patterns for the structural semantics of the data, whether that's a star schema or a data vault or whichever tribe you are a member of. And I'm curious how using AI as the access layer versus human analyst who is handcrafting these SQL queries or a BI tool that is using these dimensional structures to do a visual navigation, are those still beneficial? Are those still the best way for us to be thinking about structuring the overall data assets, or do the semantics and capabilities of these AI systems and AI agents change the ways that we need to be thinking about the foundational structure and the foundational semantics of the data assets that we're producing?
[00:18:57] Ido Bronstein:
Great question. So first of all, I think that nobody, still figure it out. What is the right architecture for doing effective data warehouse for AI? I'm sure that there are things that will be preserved. For example, you need to create some kind of intermediate stables in order to be able to create this data efficiently. And I'm sure that the ability to curate the data like bronze, silver, gold, in a sense, will stay because we want to clean the data and then put the right semantic. But exactly if star schema, is the correct architecture for AI, I'm not sure.
I think that it's things that we see every organization solve differently today, especially try to stitch AI on its current architecture, and it's work. Like, you don't need to change all your, architecture in order that AI will be able to process data. You just need to make the data reliable, high quality with the right semantic, and we will see if well it will evolve.
[00:20:06] Omri Lifshitz:
Yeah. And I will add to that that one of the things that you did talk about, like, the tribes you have, h one, believing they have the right the right schema and the right way to structure it, Snowflake, star schema, and so on. I think one of the interesting things we saw while working with companies is that the models are very good at capturing the way in which you've structured your your data and work on top of that once you've put a good data model in place. So the need to do it and to define what your data model is and how you want it to look like, that's still something we'll need to do. I think the difference between whether we're doing this way or this way, the discrepancy will probably become, much smaller over time because the models are able to understand it. So there it's kind of a preference. And as you just said, we're still not sure what the best schema will be and what the best data model will be going forward. But we do see that even when you're using different things, the LLMs are able to capture the essence of the data from it.
[00:21:03] Tobias Macey:
Another interesting aspect of the fact that we do have these AI systems consuming the data and we need more data to be able to make them as useful as possible is the challenges of also not wanting to flood their context window. Because if you give it too much data, then it goes from being your smart sidekick to your dumb sidekick. And the contrary is also true where it could be your dumb sidekick if you don't give it enough data, and I'm wondering how you're seeing teams think about that balance as well of either not feeding it too much data because it's not able to differentiate the useful stuff from the not useful stuff or making sure that you're not starving out of data and just figuring out what that balance is. And, obviously, I'm sure the answer is it depends, but I'm wondering how you're seeing teams go through that discovery process and then maintain that balance as their systems evolve and as their data feeds evolve.
[00:21:59] Omri Lifshitz:
Yeah. So I think one of the critical things here is, and I talked a bit about this earlier, is the fact that you have to both curate the context for the kind of tasks you wanna do. And then there is a critical aspect of how do you serve this context out. So how do you make sure that you're delivering the right context at the right time? And maybe sometime that means you wanna, minimize your context, like, the context you already have, summarize that, and then move it to a kind of sub agent architecture, which focuses on something specific. So there are a lot of things in which you need to deal with that in order to actually solve the issue. And that means, how do I map the context that I wanna bring together for a certain task? Once I have that, how do I wanna continue maintaining this over a conversation and a task for an agent? There are a lot of different nuances there. And, again, I didn't wanna give the answer you said, but it depends exactly what you wanna do. But the curation and the serving aspect of this context, I think, are two of the most critical things we're seeing today with how people are engaging with agents and LMs, and this will be something going forward as well.
[00:23:02] Tobias Macey:
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into week long success stories.
And the other piece of it too is that because we have access to AI and we can start pulling in these unstructured data assets, it also necessitates that we process those so that we can have as much data as we can to feed into these systems. And then, also, it opens up the possibility because of the potential for increased developer productivity to start ingesting external and third party data sources and not only relying on organizational internal data. And I'm curious how you're seeing teams tackle that as well of even seeing what data is available outside and then evaluating and identifying which assets are going to be useful for their particular problem space.
[00:24:28] Ido Bronstein:
Yeah. So I think that's, like, is a two sided source. With one head, teams have, like, access to much more data than they have before. Like, Like, scraping the web is easier, and connecting to a new system and creating the connector is much easier. So they have much more data. And then they need to understand how to make this like, use this data. And this have two main problems. First of all, they need to understand, how to connect the data that they gather to their already enterprise data, first party data. This is a problem that takes the data engineering team a lot of time. And secondly, as they need to understand how to make all of this flow all the time in a very reliable way. And this is also hard because as you know, the output of the LM is all not all always deterministic. And if you're scaling the web, the website can change and it can affect the data.
So after you already succeed to understand how to use this data, making this reliable is the second challenge. And but the amazing thing is that it's open the amount of use cases to much more. Right now, you can do fraud detection with external data and not just based on the activity of the payments that you see in your system. You can do target marketing for using, like, reviews in Facebook or other social network, and we see all this use case. So the possibilities to do things with data just increase significantly. It's good. And other challenges of the data engineering team because the bottleneck is not on the collection of the data.
It's on the ability to process and really use that.
[00:26:22] Tobias Macey:
And beyond just the data and circling back around to the skills gap for teams who are responsible for providing that data, what are some of the ways that they should be thinking about either bringing on additional talent or working more closely with other teams in the organization or investing in new skills development for helping them figure out how best to actually work with or understand the problem space so that they're not just left flailing or, you know, trying to bail water out of a sinking ship.
[00:26:57] Ido Bronstein:
Yeah. And I think that what you described now is the feeling of a lot of data engineer teams that we are talking with them in the routine. I think the most important thing is to understand how they can use AI in order to excel their work because AI open for the organization, also for communities to change and to innovate, but also open the the same opportunity to the data engineering team themselves. And the nice thing that AI agents allow us is to stitch in a smart way a lot of different pieces. And as you know, like, a lot of the technical difficulties of managing data pipeline is the fact that you have data spread a lot of across different tools, across different stages, and I can really help you with that. I can really understand, which tool you are using, understand the data, connect between the pieces, and in a sense, make all the tedious work of data engineers much smaller. So what they see in advance data engineering team, for example, a team in Netflix, in Wix, they really leverage AI in a smart way in order to obstruct a lot of the technical details in building their pipelines so data engineers can focus on understanding the business, understand what is the next business use case that we can use, which data we need to bring in in order to improve the fraud detection or the conversion of our product. So in the all the technical work that data engineers doing today is done on this team with AI.
[00:28:39] Tobias Macey:
I think another interesting aspect of the era that we're in right now and some things that I've seen in my own work is that the introduction of AI as an end user facing capability in particular is forcing more of this combining of different teams that maybe have worked in their own particular areas and re forcing more of that cross team collaboration because there is no longer as clean of a handoff and those lines are blurring. And so you have to move from application development to data management because you need the data in the application to power the the AI, and you also need to bring in the machine learning or data science teams because they're the ones who understand how to deal with the experimentation and deal in that more probabilistic space.
And how are you seeing that shift maybe the organizational structures as well where maybe we've gone from having these dedicated teams to everybody is one team, or maybe you're doing more of the, embedding strategy where you maybe group clusters of people who are in separate teams into their own little operational units to be able to have more of that end to end context and capability without having to have as much of a hard handoff between those stages?
[00:29:56] Ido Bronstein:
Yeah. So the first impact that they see in organization, and it's really, like, you can see it actually in every big organization. The amount of data people that's, like, trying to bring in and the people that really understand data is much larger. So you see the organization trying to bring the best data scientists, best data engineers, best data analytics so they can really use their data. This is the first thing. The second thing is that all the organization start to talk in the data language. Suddenly, the software engineers start to understand that data that they produce really bring value to organization.
So you see a lot of organization, starting to implement data contract and ability to manage the data across different part of the, application. And the last thing that we see that we are seeing squads of, like, application engineer, data engineer, data scientist working together on the same task. When in the past, it was much separate. It was like the application engineer work. He threw data somewhere, then the data engineer ordered that, and then the data scientist do something with that. Now we see much more squad teams that taking a specific task and work it together in order to solve it. And another interesting
[00:31:13] Tobias Macey:
approach that I can see being feasible is maybe rather than restructuring all of those teams, taking some of the top performers from each of them and having them be an enablement squad and maybe defining some useful templates or context structures for being able to feed into agents that those teams can actually use to do their own work or act as an automated reviewer of the work that they're doing or for that team to do more of a, architectural consulting approach of evaluating what are the current systems and helping to identify some of the new capabilities or new workflows that need to be developed to make sure that everybody can operate at a higher level.
[00:31:58] Ido Bronstein:
Yeah. So I think we're in the being beginning of the journey. Right? So in in the beginning, you are putting the best and the most talented people on the innovation task. And I think this is what you tried to say, like, taking the best group of people that you have and tries and say, we'll, like, understand how to take those POC into production first so all the organization can follow be follow follow after.
[00:32:20] Tobias Macey:
And in your own work at Uprover, where you are trying to be more of that enablement layer and help teams automate some of the data work that needs to be in place to actually enable them to explore this new AI era. What are some of the core engineering challenges that you faced in terms of being able to actually build a system that moves more of these data workflows into an autonomous layer or works with those teams to be able to reduce some of the drudgery or automate discovery or, those those various capabilities?
[00:32:59] Omri Lifshitz:
Yes. So I think one of the most interesting things we saw is that we kind of when we started out, we expected the models to be kind of like a magic black box that you can just throw whatever you want at them. And I think a lot of people had that illusion at the beginning where you can say, okay. I'll throw everything into the new GPT model, and I'll be able to get things working. And that obviously is not the way to go about it. And we've had to solve a lot of very big engineering tasks in order to make our our system actually usable and build, like, the right teams around it regarding how do we understand which context we need to collect from the environment we're connecting to. So this, it means also collecting code information. How do we profile data and bring that context in to make sure that what the system is doing is reliable? How do we put in place the right validations in the system? And when do we need to go and then get the human in the loop to get all of these things working correctly? I think being able to do all of these things and understanding how to put them in place was one of the biggest challenges we had to face here. And when we look, for example, one of our I think the most interesting things we did was we kind of reverse engineer how cloud code was working and client and all of those coding agents are working. And what you see is that essentially not only are they very good with the models, they know how to make the model and the full system work like a software engineer does. And you need to have that kind of mentality. You need to understand how somebody doing this task would work and then know how to tie in all of the relevant things from there. And again, as I said earlier, like, the idea of how you build and engineer things and the system architecture you have to do it in, I think that still is one of the most critical things that we have today. And that's where, like, you see the really good software and data engineers being able to elevate how things are being done. I think that's critical piece that that is still going to go forward with us as we automate more things and as we bring in new tools that can take a lot of their busy work away from data and software engineering team.
[00:34:50] Tobias Macey:
Another aspect of this overall space is as is the case with software engineering, but I think more critically in, like, a data engineering context, the fact that every team has their own opinions of what tooling they want to use, what platforms they're building on top of. Obviously, there are broad level abstractions that we can use as commonalities. But when you get into the implementation specifics, there are wide variances in terms of how those technologies operate, especially when you move between batch and streaming. And I'm curious how you're seeing that play out as far as what are maybe some of the core primitives that are most essential to be able to actually feed into those AI systems where maybe batch has been working well enough for human driven analytics, but maybe AI is a forcing function for a broader migration to streaming or maybe vice versa. I'm just curious what you're seeing there.
[00:35:42] Omri Lifshitz:
Yeah. So I think one of the, again, it goes how to how do we curate this context? One of the key things that's relevant is what abstractions do you need when you come to approach data engineering tasks. So a batch pipeline is a batch pipeline no matter what what underlying technology somebody's using for it. And the data model, you have all of the semantics within it, but still, you need to understand how things rely together and how the dependencies between them. And I think that's one of the critical aspects. Understanding the ontology of the platform and what abstractions you wanna put in place for it one of the critical things you need to do in order to actually make this usable across different data teams. And I think that's also one of the things you see in software. Like, you might have people writing code in Python, other people writing code in Go. Still being able to understand how you need to look at this modularity is one of the critical things that you need to do in order to get these systems working. And, definitely, I think we will see more use cases for streaming if people are pushing towards more real time things and on the go analytics that we're seeing now. But the idea of how you abstract away these different things so you can make the model work correctly, I think that's a core that's going to stay across nearly every aspect of agents working.
[00:36:49] Tobias Macey:
And what are some of the key struggles that you're seeing teams deal with as they are going through all of this effort of building these AI powered systems, getting their data up to par, figuring out what data assets they have and which ones are actually useful?
[00:37:05] Omri Lifshitz:
I think one of the biggest challenges, and I think this goes to, like, data maturity that companies have, a lot of companies are now trying to say, I wanna use AI to enable my data teams, but they're still not clear on what they wanna do with the data. And I think that's, like, the first principles and fundamentals of what you want to do, and then you are able to build things on top of that to actually make this usable. So no matter how much AI and tools you put in place, if you don't know what you want to achieve, I think that's not going to work. Once companies do know how to do that, we see them trying to use the existing software engineering tools, which are amazing. We're also using them, but they're just not the right tools for data engineering tasks because they lack this context. So we see teams trying to manually prompt all of the context that's relevant for data into cursor before going on each task they need to do. But we've seen teams taking lineage screenshots from other systems and putting that as prom as part of the promp into CheckatGPT to then get it to write the right SQL. So there are a lot of issues there where you still need to understand how you bring in the knowledge that the data engineer has into the system to actually make them do the right task that you want them to.
[00:38:14] Tobias Macey:
And as you have been working with some of these teams and onboarding them onto Upriver and helping them understand the overall problem space, what are some of the most interesting or innovative or unexpected ways that you've seen companies trying to make their data AI ready?
[00:38:30] Ido Bronstein:
So I think that we see companies trying to tackle this across all the maturity, levels from companies that hiring more data quality analysts that now needs to check all the data and all the records and create all the semantic layer, like, manually, just from, like, sticking with all the business and trying to dynamically update that. We see companies that try to use as a reset, like, automation that software engineer uses, especially like Cloud Code and Karsall. But then they are trying to understand how to gather context about the data using, like, MCPs and screenshots and some resets for, like, data systems. And we see very mature companies, for example, Netflix and, like, Wix. They built, like, their own automation tools in order to understand what is this context. For example, we heard we heard how they're, like, connecting to Slack and Notion and creating all the data documentation automatically based on the unstructured organizational and knowledge that they have in other system. This is, like, I think, very cool example that we have.
[00:39:43] Tobias Macey:
And in your own work of building the system to be more of that autonomous data engineer and exploring this overall space of how do we actually use AI to build these systems, what are the AI systems that are consuming from this actually need, and how do we make sure that they're speaking the same semantics? What are some of the most challenging or unexpected lessons that you've learned in that process?
[00:40:10] Ido Bronstein:
So as we said, that there is no magic. Like, AI is not magic. You still need to properly build the context of the knowledge that you want that AI will use. So for us, the biggest challenge is to understand how to create route ontologies from the context and understand how to create the right connection between different entities. For example, what is a pipeline? What is a table? What what is an entity inside the tables? And then to help the AI understand those technologies and use the specific context that we gather from our customer environment in order to enable the AI to his use case, to his current specific task that he's right now asking for. And I think this is the the challenges that we solve and craft. Building the right context is the things that makes our product really work for data engineer.
[00:41:11] Omri Lifshitz:
Yeah. And I will add to that one thing just to because I think it's it's interesting in the way, like, you build the agentic systems. Not everything has to be directly done with the LLM. Right? There are things that you do know that are not that are going to be deterministic in your system, and you want them to be deterministic. And those are things that just regular software engineer can still do. And being able to play on the final of what is deterministic and where can I use the LLM to enrich my capabilities by miles? I mean, like, that gives you the ability to look at code and understand it. But then maybe I need to go back to a validation module inside to make sure the pipeline was done correctly. Being able to ping pong between these two modes of working, I think, is what gives the edge to really production ready AI system rather than just somebody throwing things into an MLM model and expecting it to work correct.
[00:42:02] Tobias Macey:
And for teams who are excited about the potential for AI to automate their data engineering work and let them focus on more of the business oriented aspects, what are the cases where an agent based or autonomous approach is actually not the right way for these teams to be dealing with their data pipelines and data issues?
[00:42:25] Ido Bronstein:
So it's depending on the maturity of the team. In order that the team will need AI in order to manage his data is when he knows what is the exact business problem that he want to solve, and he wants now to deliver that. Teams that just right now understanding what the business need and trying to map their data and understand which data they have in the organization. Their work is more focused on the business and the ability to understand, how they can enable the business. But once the focus move to the technical part of building its data, structuring it, and managing that, this is where AI can really hit in. Because the build standard statement will always or I think will stay with the human for a lot of time. But technical will be abstract over time, and this is where AI can really enable data engineering team.
[00:43:20] Tobias Macey:
And as you continue to invest in this problem area and as the overall industry progresses further into our understanding of how and when to use AI and what the requirements are of the models as they continue to evolve in terms of their capabilities, what are some of the trends that you're paying particularly close attention to or any problem areas or new engineering patterns that you are investing in or doing exploration of?
[00:43:51] Omri Lifshitz:
Yes. So I think, one of the major thing that came up, I think, like, in the last few months that we're really focusing on is sub agents and the ability to kind of send, okay. What context do I need to derive specific tasks within my overall Altium task? I think that we've started using this. This has given a lot of value right now, and we see this going forward. And how do you manage these things? I think that's one thing that will change. The second thing is how do we manage context as models are able to gain bigger context windows, and they're starting to gain memory with as part of the model as well. That will also kind of change the way in which we need to manage the context we're giving it. So that is a big thing we're paying attention to and seeing how this will affect the entire industry. And, again, I think, like, in essence, the idea of how we store the context that will that will probably change in the next year or so. We see things moving so fast. But the idea of how we curate it and understand what context we need and what it means to actually know your data state, And that is essentially what a data engineer does best. Right? Knows what is actually happening in their data environment, what semantics and metrics. And so when they're trying to do and what they're trying to calculate, that will stay. And I think we're trying to understand what is the best way as the models and the paradigms change to be able to deliver this value.
[00:45:07] Tobias Macey:
Are there any other aspects of this overall situation that we're in of the gap between the needs of these AI systems and the ability for data teams to be able to fulfill those requirements that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:25] Ido Bronstein:
I think yes. And I would love to, like, picture picture the image of how I see this space in couple of years for now. So if I'm looking on this space today, today, most of the data management work is done by data engineers. And you have, like, tools that helps you from data quality, data cutout, to semantic layer. In four years for now, I see it completely changed. I think that you will have an AI based platform that helps you to really automate how data manage from end to end, from the minutes that data land in your platform, to the quality, to the catalog, to semantic, to the building the pipelines themselves. And data engineers will oversee those AI engines and will define what the business need. They will define which data they want to enter the platform. And they, of course, control it and, like, create the right ontologies and the right metrics for the business.
And this layer will be much more focused on what the business need and must latch focus on how I tactically create this pipeline and manage that. And we see the first step for that right now in the industry, especially in the very advanced companies, that in four years, this will change how data management,
[00:46:42] Tobias Macey:
is done. Yeah. I definitely agree with that, and I'm seeing that in my own work where I am very knowledgeable about how these systems play together, but I don't want to get bogged down in the details of figuring out, oh, well, I missed this character in this batch call, or, oh, I need to make sure that I add this particular function call or this particular annotation to this model to make sure that it gets processed downstream. I just wanna solve the overarching problem and let the AI deal with those granular details. But I am there as the supervisor to make sure that it's not making foolish mistakes or going down the wrong rabbit hole, which is, I think, a a skill that people who have been very focused on the individual contributor track are going to need to really level up into because they haven't gotten used to that aspect of being more of that management role and overseeing other people's work where, in this case, the the, quote, unquote, people are the AI and just moving beyond maybe whatever ego they may have attached to the craft of their specific software practices or architectural patterns and focusing more on those broader objectives that they maybe don't have the time to actually think a lot about because they're so busy with those granular details.
Exactly. And I think that the industry will more grow more and more for that area. Alright. Well, for anybody who wants to get in touch with you both, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:19] Ido Bronstein:
So, again, the biggest gap today is the ability to stitch the pieces Still today, ability to, with one hand, understand the data and understand what it mean. And in the other hand, understand orchestration, like, DBT transformation, partitioning in storage, and cataloging the data and data quality. Seeking all of that is a hard problem. Starting to hold all these, like, tools and orchestrate everything in the correct way, it's just too complicated today.
[00:48:53] Tobias Macey:
Yeah. Absolutely. We are continuing to add more layers of complexity, and we're componentizing so you don't have to necessarily have the entire vertical complexity of some of the tools that we've dealt with over the time. But it just means that that knowledge has to be more diffuse.
[00:49:09] Ido Bronstein:
Yeah. Yes. And I from what we see, AI helps you to really solve that, solve connected with between part and AI can use this tool and learn one time and do it, like, in every in every company.
[00:49:22] Tobias Macey:
Alright. Well, thank you both very much for taking the time today to join me and share your experiences of working in this space and your thoughts on the demands and shortcomings of the data systems that we have as we start to bridge into these more AI driven capabilities. It's definitely a very necessary field to be studying and focusing on. So I appreciate all of the work that you're both doing to help make that a more tractable problem for a broader set of engineers.
[00:49:49] Ido Bronstein:
Thank you very much. We really enjoy it. Yeah. Yeah. Thank you for having us. Thank you.
[00:50:01] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. Notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and colleagues.
Intro and episode setup
Guest introductions and backgrounds
AI amplifies data demand and pipeline bottlenecks
Productionizing AI: costs, error compounding, and reliability
What data do agents need? Catalogs, semantics, and serving
Do classic warehouse schemas still fit AI?
Right-sizing context windows and sub‑agent strategies
Using external data: integration, quality, and reliability
Closing the skills gap and leveraging AI for data engineering
Org structures shift: squads, data contracts, and enablement
Engineering Upriver: context, validation, and human-in-the-loop
Batch vs. streaming and core abstractions for agents
Common struggles and creative stopgaps
How leading teams make data AI‑ready
Hard lessons: ontologies, determinism, and agent design
When not to use autonomous agents
Trends to watch: sub‑agents, context, and memory
Future vision: AI‑driven end‑to‑end data management
Biggest tooling gap: stitching data and orchestration
Closing thanks and outro