The AI Data Paradox: High Trust in Models, Low Trust in Data

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.

The result, inflexible infrastructure that can't adapt to different workloads.

That's why Cash App and Cisco rely on Prefect.

Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.

Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.

Orchestration is the foundation that determines whether your data team ships or struggles.

ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform.

Whoop and 1Password also trust Prefect for their data operations.

If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake,

migrating stored procedures to dbt, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed

price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull

to book a demo and see how they turn months long migration nightmares into week long success stories.

Your host is Tobias Macy, and today I'm interviewing Ariel Pohorilis about data management investments that organizations are making to enable them to scale AI implementations. So, Ariel, can you start by introducing yourself?

Of course. Hey. Hi, Tobias. Nice to be here again. Yeah. So my name is Ariel Perales. As you mentioned, I'm with,

Boomi. I lead

product marketing for the data management offering of the Boomi platform.

Been a data engineer in the data space for more than twenty years now, and, happy to speak about the data.

And do you remember how you first got started working in data?

Yeah. I actually learned

data engineering at school, at university, and it wasn't called it engineering back then. It was what's called information system engineering, but essentially the same idea, learning about databases and concepts and and SQL

and coding and whatnot. So I I was always around the data. My first job out of school was a being part of a a data team in a large semiconductor

company. I then moved on to a

be part of a data analytics vendor company. This way, first found found my love for combining my data experience with a product marketing. I went on to manage a data consulting firm where,

I did the all kind of data projects with different customers of different size, different industries. And then again, in the the past three years, I've been back on the product marketing side for, the data offerings of the buoy platform.

And so we're here to talk about the recent survey that you and the rest of the Boomi team conducted about the current state of investments

that various organizations are making in terms of the data that they need and the AI

initiatives that they're trying to power with it. I'm wondering if you can just start by giving a bit of the overview about what you're trying to achieve through the action of actually conducting this survey.

Yeah. Absolutely. You know, in in product marketing, a lot of things that we try to do is is create a lot of thought leadership content. And

to me, the best way to create a comment is really based on on data.

Not surprising. Right? So

running this type of surveys is is kind of usually provide the best evidence and the best understanding of what's really happening, in the market. Obviously, not just looking at our own customer base, but in general, among the community.

And, of course, AI is is top of mind for for all of us. But I think specifically

for the data engineering world,

the impact of AI is not just around

providing us more tools for for productivity

or or efficiency gains as as we see for most other jobs. But in data engineering specifically, I think we see the impact

on asset engineers

being involved in also the the motion of creating AI solutions and building those AI solutions. So want to better understand,

how is that going to

affect the workloads of the of the AI engineers? What does that mean? And specifically at Boomi, we're heavily investing in not only providing the the foundation, what we need to leverage AI. So think about

everything you need to integrate your data, to integrate your applications,

to manage your APIs,

MCP management now with with AI kind

of, being consumed in in different ways, but also actual agent management

around being being able to build and govern agents. And so with that in mind, we were really curious to see how data leaders are seeing the data work changing, as they prefer to enable AI across the business.

And

as far as the

role of data engineers in this new era of generative AI and the high demand for data of varying shapes, I'm just wondering before we dig too much into the details, what are some of the key takeaways that you found most significant, surprising,

interesting, whatever adjective you want to apply to it?

One thing I should probably add just before I answer the the surprising takeaway or or the significant takeaway is is that we we did that research

with 300 different data leaders working across eight different markets. It was recently done actually just

earlier in this year. And and most of the respondents were in The US, The UK, good representation out of Europe, Canada, Australia, and New Zealand.

And we chose four kind of popular industries, tech, financial services, retail, and all that. And and, of course, we came into that with with assumptions in mind of of

some specific,

ideas that will probably,

hold true.

And indeed, that's a that was valid with with a lot of the, data points we we see. But then there was definitely some surprising or significant takeaways. I think one of those was, around

the fact that

many, data leaders consider their organizations or themselves to be fairly mature around regular data monitoring and quality checks,

but only 50%

trust their data. And so that that was quite surprising. And, you know, you have a maturity around data monitoring quality, but still, you didn't achieve that level of maturity that they,

that would give you or that that they,

that trust level that they you would need. And especially that that's

more impactful in in the age of AI because, you know, AI can go in different directions. But but even for standard analytics, if you only have 50% trust in your data,

that's to me a a quite of a, a red flag, if you will, that indicative of the bigger challenges that we have here. And I think that's that's not new. And I mentioned I've been doing data for a lot of years now, and I think that's always something that we've been fighting to a certain extent. Even though when I started my career,

data engineering was much simpler,

far fewer sources back then, and everything was on prem.

Refresh rates were smaller, data volume was smaller, but still, managing data quality was always a challenge. And I think, it's obviously just getting more and more complex.

And and now with AI,

that's that's probably the biggest,

kind of a

takeaway that they we take out of this that they that low trust in the in data.

I think one one of those stat that we saw in the research that may explain a little bit of that was the low maturity of of level of automation in data pipeline. And that's something we'd probably talk about a bit more later on, but maybe that's that's the reason to explain why there's still only 50% of organizations really trusting the data.

To the point of that lack of trust, I think it's possibly also just a matter of the level of attention that's being paid to it, where if you don't have any thought of adding data quality monitoring, you're just like, oh, I've got lots of data. Of course, I trust it. But if you're actually doing that work of measuring the quality, measuring the reliability, then you might be a little bit more skeptical by nature and say, well, I have these numbers that say the data is good, but I'm not totally sure I believe it. So it's maybe it's that healthy dose of skepticism

that is biasing some of those responses.

That that is a very good point. I I definitely agree with that.

And so digging now into some of the survey results,

I think one of the key takeaways is that a lot of the data leaders trust the data that they're using for their AI systems, but they don't necessarily trust the overall data quality. So to your earlier point and I'm wondering how that reflects some of the ways that those people where maybe they have a higher

degree of confidence in the data they're using for AI, but low overall confidence, how we can reconcile those two. Like, where is the data for their AI systems coming from? Is it that they have two different answers even though it's maybe relying on the same underlying sources, or are they curating only their most trusted sources to feed into their AI systems, which gives them higher confidence in the actual usage of that particular area of data versus their overall impressions of data at the organizational level?

Yeah. I I I I think you're definitely diving deeper in that. The key takeaway I mentioned with, you know, that paradox of 77% of leaders trusting data in their AI systems, but only 50% trusting data overall. And I think it's mainly

suggesting that most AI systems are currently built on a very small static dataset. I think, you talked about kind of the golden dataset, if you will. It really makes sense when it comes to feeding AI systems with unstructured document and specifically create datasets.

But I think it also suggests there's there's a white space there. There's there's still work to be done when it comes to feeding our AI systems with continuous data feeds of both unstructured and structured data. And I think the other assumption that comes to mind is also that some AI systems are built directly on top of source application datasets instead of on top of centralized data systems, such as data warehouses and lakehouses. So as a way to kind of make sure that they we trust the system, we plug it directly into a certain application

and only run the, the AIK functionality

on top of that, the source system that obviously makes it a bit easier to make sure that data into that system is accurate. But then that means that the solution is also a bit more narrow in scope. It's a bit more isolated than the rest of the, the universe of your data and and can probably reason and do less for you because it's it's created just on top of that, the smallest subset of data. If we look at the technical

challenges that are kind of limiting,

data teams of of going and deploying AI across, say, all the data sets, I I think most data engineers these days, you know, to a certain extent, they got exposed to building rack data pipelines. They understand how to feed certain AI systems, vector databases, and all of that. Now it's becoming almost like a standard data type within a, the databases they've been using so far. They don't have to they don't have to go even to a dedicated vector database in many cases. But there's still some barriers when it comes to the setup. And I think the bigger problem we face when we look at how to scale AI systems usage of larger datasets

as we want to make sure that data is trusted, is that with AI system,

what what used to be good enough for a analytical system

is not good enough for AI systems. What I mean by that, I'll explain in other words, is that prior to AI, if we build an MVP that, let's say, was just a simple dashboard. Okay, for the sake of the example. If we build a dashboard, even if the dashboard is 80% of what the users ask for, it's usually a great starting point. And with AI, delivering a solution that's only 80% of what the users ask for is actually, providing or or exposing us to a great risk because AI can do so much reasoning on its own and provide very different outcomes than what we could expect. And and the the consequence of that could be quite quite scary. So I'll give an example of our own deployments. We, we've added into the Boomi data integration platform an AI assistant that helped our users to ask different questions about the way the platform works. And so we connected to it, to our documentation site and to a few other wikis and data assets that we had to help essentially shape data responses that users would typically ask around how to operate the platform. For example, how to build a change to the capture data pipeline or how to orchestrate a certain process or whatnot. And what we saw is that many users were asking this type of questions, but we also had a great population of users asking pricing questions, which we obviously did not plan for. And this is where we saw the risk of the assistant not answering correctly

or providing

information that may make it sound complex or is not positioning the platform correctly. Of course, we were lucky in that case that

the answers that were given were were, relatively fine and we quickly, mitigated as as we saw it. But that kind of risk was was really clear to us, immediately because that's not a scenario that we considered, but AI's ability to to do its kind of own thinking really, kind of pushes you to the point where you have to make sure that all the data you provide to the, to the AI system is really trusted and and you have the right guardrails around it. So if I take it back to the technical challenges, I think we do expect and and see it in research as well that the need to connect AI systems to more data sources is growing. But at the same times, data teams are already quite busy. And

to make sure they can provide a wider data set and not just a small golden asset to a certain AI system, There's a lot of automation that will be required in the data pipeline and move away from from some of the manual process they they've been doing so far to to ensure the air quality. That's gonna be a big part in, in how we can make sure that they we can scale that across larger datasets.

Another aspect

of

the source data that we're relying on for these AI systems

is that for air quotes traditional analytical use cases, we've largely been dealing with structured data assets. We have a lot of tooling and experience

working with those styles of sources, dealing with data warehouse architectures,

figuring out how we build quality checks into our pipelines for mutating and populating those data warehouses and delivering them downstream. We have sophisticated lineage tracking.

But as we bring data into these vector stores, as we're relying more on unstructured data assets, whether that's just free text or semi structured JSON blobs from various sources, and maybe we're not doing all of the same transformations

on those to be able to route them through our structured data pipelines.

How does that complicate

the

level of trust and level of curation

for being able to actually provide that information to the AI systems and maybe what are some of the additional

technologies or practices that we need to develop to be able to build some of those same type of data quality checks when we're dealing with n dimensional vectors

where trying to actually pick

visualize that makes your head explode.

Yeah. Absolutely. And you're right. I think that's part of it. And, you know, it's almost too easy to kind of rely on AI. Right? Like, let's just feed AI with the unstructured data and AI will figure it out, but

it works nice on a, on a pilot, but it doesn't work well in production. That's, that's a problem of that. That's why I think most organizations still stuck in pilot modes versus a really productionizing AI solutions. And so to your point, what are the new types of data quality checks we can do? How can we actually,

go about ensuring the quality and and doing that not just with human in the loops, which which we all know we need? And I think a great part of it is is applying a lot of the concept that we had around the way we monitor data quality at at our ingestion layer and data storage layer, but do so at the, at the AI solution layer. So we wanna make sure that the AI responses are accurate and relevant, and and we need to make sure that we have the visibility into

not just the outputs that the AI system provide to the users, but also the workflow that the AI process went through as it, as it produced that. So getting full access to a to the trace, have a good understanding of of what's happening in each step of the process, the activity log. These are things that we invest in quite a bit actually in our own agent management

system. And we provide users to begin with a lot of governance tools around different guardrails they can define or anomaly detection monitors that gave them full access to understand what is happening and then, and then ensure quality accordingly. And I think that's why sometimes we see the speed of moving a solution from a from pilot to to production really slowing down on the AI side because of kind of that uncertainty of of what's going going to be

happening to us from an output perspective. And getting that visibility is really what helps us kind of quickly iterate and and and tune the the models and the the solutions to to provide us, say, the right say,

so should we need to be able to deploy that with with minimal risk.

Another relevant and accompanying takeaway from the report is that in terms of actually determining the data quality and doing the validation work, a lot of organizations are still relying on manual processes for that review, which obviously isn't sustainable

as even without the requirements of AI because of the influx of new data sources and the integration requirements across them.

And I'm wondering how you're seeing organizations

think about that

bridge from, I have these manual processes. How do I actually

build enough consistency in them to turn them into an automated process? And maybe

what are some of the ways that we can

either bring existing

large language models or fine tune models to be able to learn some of those manual processes and patterns and automate them so that they can scale beyond just what a single or a team of humans can actually

process and comprehend

given the influx of new sources and new requirements?

Yeah. I mean, this was one of the most surprising stats for me. We saw that only 42% of organizations have automated data pipeline. You know, I've been working on modern data integration tools in at least in the past five years. So I sometimes forget that not all organizations are at that point where their connectors are fully managed and, you know, I know that still many organizations,

don't have automated CDC data pipelines for efficient incremental data extract, but but still, I'm, you know, kind of, I guess, sometimes forgetful that there's a lot of a kind of manual workflows there or or at least ETL and and data pipelines that are still not automated from the point of view of detecting

a source API changes and and updating for that or or handling schema drift

or or even just a automated incremental

load into into your warehouse or lakehouse. And and so that number was higher than I expected. And and I think the main hurdle for organization is as much as they there's all those shiny LMs and and other options out there to to speed up their migration, there's still a lot of legacy, especially with with larger enterprises, legacy platforms out there. A lot of the,

code and and the logic in those legacy platform is hidden behind GUIs, graphical user interfaces that are very hard to to understand sometimes. You know, someone build it over ten, fifteen years and just going into all the business logs behind it and understand what's what's happening there, moving that to just plain SQL that you can easily move around is a big task. And so I think the fear is real. I understand why the cost of migration is a concern for many organizations. And my advice to them is, they need to analyze the ROI of such migration and need to do so, but not just looking at the, at at the, the potential

value of of the new system, but also the time they spend today on on maintaining those current platforms and the the lost time they have when it comes to build new solutions. So I think in an ideal world, what this organization should do to to increase the data quality rather quickly is is, as always, gonna try to pick the high value use cases,

and start automating those with data integration process. And then from there, hopefully, they can they can start to see a clear path to to planning a full migration out of that kind of legacy systems. But, yeah, there's still a lot of those legacy systems around and and

not easy to introduce LLMs on top of them,

as they stand. So you have to kind of find a way to to modernize them and then be able to leverage those options.

Composable data infrastructure is great until you spend all of your time gluing it back together.

Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.

Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.

Go to dataengineeringpodcast.com/bruin

today to get started. And for DBT cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

I think given the introduction of the legacy systems conversation, it might also be interesting. I don't think we illuminated this at the beginning of the conversation, what the composition is of the survey respondents as far as the maybe geographies and scales of organization

because talking about legacy systems always makes me think enterprise, which is not a fair association. But I'm just wondering if you can maybe talk to some of the ways that you selected the respondents and some of the feedback that you got and what that distribution looks like in terms of what data you're pulling from.

I think I mentioned it quickly and, when I talked about kind of the key takeaway that surprised me, but,

I'll just maybe go a little bit deeper. So we surveyed 300 respondents.

They were all data leaders, either in the senior executive,

working in in data analytics role or someone that's a actually familiar with data management and AI strategies within the organization. And

those

data leaders, these are the one that essentially provided the, the quant the the quantitative response to the survey. We also had a a few interviews that went deeper to provide some additional quality of the answers, but all the stats I've I've been quoting so far, these are all from the quantitative numbers of those 300 data leaders. They come from The US, The UK,

lots of the other European countries. We have surveys from Canada, Australia, and New Zealand, and this was done in

the top industries like tech, financial services, retail, manufacturing. These were the top four industries I think that the response came from. We collected the data back in April and May 2025, processed our findings, and I think we we published the the results about a month and a half ago.

And digging a bit more into some of the specifics of the actual AI use cases,

AI as a term has largely been co opted to be synonymous with generative AI and large language models. But in its technical definition, it also encompasses

other machine learning systems, the deep learning models that preceded the current generative systems.

And I'm wondering how

the respondents are generally thinking about AI given that context and maybe some of the difference in as far as requirements when you're going from maybe a regression style machine learning model versus a deep learning model versus these transformer based architectures that are largely on the generative side?

I mean, I think

the world is obviously great is now generally associating when you say AI with generative

AI. That's where, obviously most of the excitement and opportunities is around

right now. And it's a good question you ask around how is that different than the AI that was produced with ML models and

regression or or predictive analytics that we're able to produce using a machine learning AM models that I was thinking about that day just a few weeks ago as I was chatting with a friend and I was going back to some of the project that I was running as a,

with the data consultancy team I was managing and thinking about some predictive models that we built using machine learning and how we train those models. And We always used to have the best practice of setting up, which is not unique to my team. I think all teams that deploy machine learning models do typically is, you set up your quality process or your validation process in the form of, I have

my current champion models, and I have candidate models. I'm always test testing against it to see the results of those models and and how well they perform. And then over time, I kind of promote the next candidate that performs better than my current champion to be the the next running model. And I think that's that way of monitoring the quality of the the machine learning output is is

much more similar to the way we would monitor the quality of of a traditional data pipeline. But again, it's not the same way we would do so on top of generative AI because of the ability of generative AI to go into so many different directions, which is why I think we need to get that extra visibility of of what's happening. And we can't just rely on ensuring data quality is is high at the ingestion level or at the storage layer.

Even if we do massive data management processes, these are all things we have to do, especially because the black box of what's happening with with AI output is so big, but we still need to add on top of it another layer of the of quality. And the other thing I'll I'll say around that, that maybe ties back to the one of the earlier questions that you asked me about, you know, those small data sets that AI solutions are being built upon. And my assumption is that we see a lot of those AI solutions, generated AI solutions being built on top of the source application systems, not a data warehouse or a lakehouse. And I think what that means for us from from a data engineering perspective is that now we can no longer look at data quality only from the perspective of the data that we pulled in into our own storage layer, but we also have to think about data quality back at the source system. And that's something that as data engineers, we often kind of didn't pick that fight and kind of let the business kind of manage that on their own. And we will kind of make sure that it is accurate as as far as we can control it, once we ingest it with massive data management solutions, whatnot. But I think what we need to think about now is, okay, AI solutions could be built directly on top of the source system. Let's make sure that all the efforts that we've done to make sure that our centralized data layer is accurate or at least our integrated process are accurate is also

contributing and benefiting the the source system. So we need to have that kind of bidirectional syncs that push data back into the application system as well.

That is also an interesting point where because of the fact that AI that these generative

models can be much more forgiving as far as the source data that they're working with, you maybe don't need to do as much preparation

to get to your proof of concept as you would with building a BI dashboard. That might then encourage

non data teams to be the first movers in terms of actually implementing these AI solutions.

And so there's the question of how much oversight does the data team or how much visibility does the data team even have into what AI workloads

are being used and how they're being used, that I'm wondering how that factors into some of the questions of data governance

and maybe some of the ways that the definition of governance needs to expand to incorporate

these models, especially when they aren't necessarily ones that are built and deployed in house?

Yeah. That's that's a great question. I love that question because, you know, I I think it's it's interesting. Data teams almost naturally

kind of inherited the, the role of of being the ones that would build AI solutions in the organization. But to your point, everybody can build an agent these days almost it's, you know,

lots of tools out there. It doesn't have to be Boomi. Lots of tools out there, obviously use Boomi if you're listening to this, but lots of tools out there are

are offering the ability to build agents very quickly, very easily, and can vibe code it. Right? You don't even need to be a data engineer to to build a solution. So I think,

kind of the shadow IT that we used to have a as a challenge now is becoming, you know, a whole challenge around how do we manage the sole shadow AI agent kind of world. And and I think we won't be able to really

fully

control or prevent a old users in the business to a to build their own agents. And essentially, what this will represent for us is is a new sprawl that we'll need to match. If if so far this sprawl for data teams was was to match different data sources and and being able to connect to so many different systems, the the business adopted because it's it's so easy to sign up to a a new

service that they that now contains important data that you need to to bring into to run your business correctly. The same thing that would happen with agents. And, you know, we already hear top executives talking about the the agent force, augment a lot of the, the workers in the in the business. We're gonna have thousands of agents running the business. And I think similar to how data teams kind of inherited naturally the the role of of building agents, at least, you know, from a an organized kind of corporate perspective. We're gonna inherit also that problem of managing the the agents that are building the business, even if we were not the one building them. That is an area that we invest in a lot, not just how we can build agents accurately, but also how we can govern them, whether they were built in the Boomi platform or or outside the Boomi platform, because we recognize that users will choose their own systems. And it's important for us to give the same level of visibility

and governance around who's using the agents, what are the agents actioning, how can I disable agents, which agents is certified? All those things are important

aspects of governing those agents. And we invest quite a bit in providing that visibility for data engineers because it will fall ultimately on them. The other team that it may fall on, I think, in the future is the application team back to the point of agents being built not necessarily on top of the centralized data warehouse, but maybe on top of the the application layer itself. And I think what we'll we'll probably see in the coming years is also a bit of a convergence between data teams and application integration teams kind of coming closer together. Up until now, they were both operating from very similar data sources. They both had to extract data from a NetSuite or from SAP or from other business systems they use. But the data team pulled that into a lakehouse or warehouse and the application integration team pushed that into Salesforce or to another system. But I think now with AI agents, they're all gonna start working together much closer and and the bridge is essentially going to be that the agent, that helps activate the data and operationalize it. But, with the right data quality that hopefully the data teams can can help ensure.

Another piece of it too that factors into that question of governance

is that

going back to our existing set of tooling for the structured datasets

and the lineage tracking that goes into them, we will also need to figure out either by extending the tools that we have or adding new considerations or adopting the various protocols that are being thrown out there, how to automatically register these different AI workloads into some sort of catalog to be able to also get visibility into not just the cost management, but the data management, whether you're feeding into it? How often is it being refreshed?

How often is it being used so that you can decide, okay. Maybe this AI system is not worth the spend that I'm committing to it because nobody actually cares or the usage that it's actually producing

isn't having the desired effect. And so being able to factor that into a lot of the typical cycles and life cycle management of the data assets that we build where we'll build an asset thinking that it's going to be useful, but if it doesn't get used, maybe we can call it and save on some of those storage costs. How do we extend that to also include these different AI workloads into that overall ecosystem?

Yeah. 100%. I I can't agree more. I think, you know, metadata is is kind of, you know, it's it's,

kind of its best days these days. It's it's probably,

one of the best assets data teams can use to ensure high data quality and and not just high data quality, but also feed better context for AI agents. So so the, AI agent output is more tuned and and in line with what the the objectives are. And to your point, kind of expanding that metadata, expanding the catalog with

agents

usage is essentially the next level to kinda manage that agent's sprawl. It it's gonna have some different characteristics than than what we typically see in

data catalogs. It's it's not just around data assets

usage. It's also gonna be around what are those

agents are doing, which is not a level that we typically capture today in in when we think about data catalog of, you know, our dashboard inventory and

the tables inventory and whatnot that we have in our different data storage layers, that this is going to be a catalog that will also list what the agents are actually actioning on our behalf. That's going to be a critical component of it, but certainly sharing a lot of similar characteristics to what we do today with data catalogs.

And the other piece

of the question of data governance

and to your point of a lot of these AI systems feeding off of application sources rather than a centralized

data lake or data warehouse

is the fact that the whole reason that we built data warehouses is that all of those different application systems have a different view of the world, different opinions, different semantics, and we wanted to consolidate them and conform them to the overall

organizational

requirements of the organizational view of things.

And so

by bringing these AI systems back to the edge, we're fighting that same battle again where we need to have agreed upon well defined semantics of what it is that we're actually talking about when we say customer, when we say purchase, etcetera?

And how are you seeing, based on the results of the survey, organizations

address that complexity

of the need to move fast and get things into

production or at least into proof of concept, but also wanting to continue with that investment and having that shared

understanding and shared semantics across the different agents that might be deployed across the various,

sectors of the organization.

Yeah. And

I think that's exactly

that challenge that we we've kind of seen earlier on that we've talked about that paradox between, you know, the high trust in the AI system that were deployed so far, but the low trust in overall data in the organization, data quality in the organization. And I think that is exactly the way to to bring that up, to to close that that gap as as we think about more useful

AI agentic solution that we can build, that can reason better, that can take into account, not just the, operational data of of one system, but multiple systems. As if we were to look at a dashboard, but the agent does it himself and decides, okay, this is the best course of action going forward based on that intelligent insight that I've received. I think this is where we would need to make sure that that metadata layer that we talked about earlier is really serving us in a way that we can leverage it to feed our agents with the right context. And so that would be kind of the connecting tissue to make sure that we don't lose that investment in storing our data in the warehouses, lake houses in a way that can be managed, but also leveraging it effectively,

via agentic solution. By the way, I I know I'm not sure that they that means that we have to centralize everything in one place. And I think what we've seen in the past few years is that data warehouses, vendors also realize that, you know, they can't expect all the data to be stored in their own warehouse. And and so, you know, what we started seeing a few years ago with external tables that interacting directly with data in s three or or other kind of a bucket storage layers. And now the warehouse is kind of using a formats like a iceberg and whatnot to manage table as well. I think warehouses are also realizing their role in kind of being able to govern all our different data assets and and allow us to to work that unified experience with the idea of being able to then eventually serve it effectively at scale for agentic solution. So I think we're getting there. The key would be to to make sure that metadata

is is is fed across not just the warehouse, but also the agents kind of being able to leverage it.

And then to your point of having that federated

data layer where maybe you have one place where you go to actually start your journey and start your work of working with the data, but you're able to actually access more than just what is in the, you know, whether it's an appliance or something like Snowflake or what have you, the warehouse

itself. You're able to also access some of the either more loosely structured or maybe not as refined data assets that live in maybe an iceberg table in s three.

How do the additional stresses and additional needs for

more and more data that these AI systems bring along

impact the ways that we're thinking about the overall

architectural patterns that we depend on to be able to

produce and curate the data systems and signal the level of cleanliness

and accuracy of that data to understand

which ones to pull from, which ones to use either for fine tuning and training versus which ones to use for a retrieval that we're going to then surface and, you know, summarize and surface back up to a customer, etcetera.

Yeah. I think it's challenged that

the that again has started maybe even before

AI demanded,

partially because the cost of store storage is cheap or partially because there's

new ways to manage data as as you mentioned, iceberg and and and others kind of surface and and really kind of expose all those options. And, you know, there was a big kind of motion around interoperability.

How can we now run-in a hybrid mode? And and I think for some organizations,

my advice is is is always trying to maybe that's tying back to my consulting days is is trying to be very practical about it. You know, it's it's very hard to define a strategy and implement without

clear business outcomes that that you can measure relatively quickly around a data strategy. We're not longer the days where we can define a a five year data strategy and kind of execute on it slowly as we go. We need to see the the payback further quickly. And so, you know, if you're an organization that has a single way data warehouse platform like Snowflake, as you mentioned, that that serves all your needs, then great. I I don't think you need to study. You have to look at at the way to branch out and and leverage some of the the new options to just for the sake of saying we have that. It's it's actually maybe something that's make your life a bit more complex when you're trying to increase data quality. On the other end, we know that many larger enterprises,

they run-in hybrid environments

for a reason and they need to have Snowflake and something else next to it and Iceberg and whatnot. The the dream would be to consolidate everything in iceberg, for example, and and and have everything operate on top of that, but that's very hard to achieve. And so I think it will have it will have to be a journey. And so going above the level of just a golden dataset that we can serve for AI system, but maybe not necessarily opening up to all the, to sit directly on top of our entire storage layer. And of course, again, doubling down on our metadata investment and finding that the catalog or that they, or even that massive data management solution that's getting like metadata second life. Now it's something that they, we stopped talking about in the past few years, but now it's coming back again to fashion. Having a massive data management layer that enables us to to feed a bigger dataset to our AI system, but still in a way that's governed and managed so we can we can trust its quality. So we've we've seen a lot of customers kinda going back now to deploying their own master data management solution and and making sure that they have at least a level of creation that they can fully trust before they they push that onto AI systems.

And the other aspect of the level of consistency and assurance that you have in the underlying data also factors into the question of risk when you're taking these AI systems and putting them particularly in front of end users, but even just when they're internally used and their potential for inaccuracies

or misrepresentation

of the underlying data. And in that context, what are some of the biggest blind spots that you're seeing these data leaders

encountering

or trying to address

to help them understand

the ways that these inaccuracies

can arise and

the ways that they are either introduced by or compounded by their underlying data quality or data investments.

Yeah. I'm not sure how many podcasts you go by, how many episodes you go by without the the garbage in, garbage out statement.

I'm sure you hear it quite often. It's always been garbage in garbage out. And to be honest, in most cases, even for AI, we're dealing with same data quality challenges we face in the past twenty years. The only difference now is that AI can scale or expose the data issues so much faster. I think the biggest blind spot to your question is that we don't really know how AI solutions will behave in every situation because the human input can be very different than what we expect and the AI output can be, can be even more different. One of the data points that came out on a survey is that 13% of the data leaders that respond to the survey indicate that they already suffered from some kind of damage in their business from from the AI system they deployed. And that's, you know, assuming that most of those AI leaders are not in production yet. Right? Because we know that, you know, 80 to 90% are still piloting AI, but not really deploying it at scale. And so to me, that that's a big number. If 13% of data leaders already, saw some damage, that is obviously a big risk here. And and I think because we can't really control the outputs and and we do have a lot of blind spots, I think we really need to double down on what we can actually control. And so I think in enriching our data correctly,

removing duplicates,

these are why I think I mentioned earlier that, you know, massive data management kind of solutions are kind of getting a resurgence. It's because of those things. Like, it's it's not necessarily easy to

to do proper deduplications

and and enrichment and and then providing the right context and semantics

without having those type of tools in place. And that's why I think, we see more and more interest now in in MDMs and and metadata tools and and so on because because that's becoming critical part of of how you make sure you can get that trusted data fed and and reduce the blind spot that you have in your in your solution.

And so given all of the information that you've gathered and the qualitative

information

from your interviews

as you're talking to

data leaders who are figuring out where to invest their time and resources and data engineers who are trying to figure out how to

prepare for this massive influx and this new rate of change.

What are some of the concrete steps that you're advising

in terms of either where to spend their focus and research, or what are some of the technologies or tooling or best practices that they need to be building towards now so that they don't end up falling behind over the next six months, twelve months, five years?

Yeah. And before we get into tooling, if if I go into tooling, that's that's probably the the the wrong answer. So, before we get into that, I think I think the main tactics I would recommend to accelerate the the strategy is really have that lens of of the high value use cases. I think that we've seen a lot of AI solutions being deployed out there. A lot of investment in AI solutions, which are nice solutions, but not really,

affecting the bottom line. Think of a, you know, different kind of tools that helps you summarize notes or or chat bots or whatnot, but nothing really that kind of helps you activate your data in a way that's that's really gonna be impacting

revenue or preventing churn or or really improving your your business in a way that's it's gonna be significant. I think it will be much easier for data teams to justify any investments they make in a strategy

to to build a foundation for AI by focusing on on the higher value use cases that can really then justify

and and provide proof for any, any work they do from a value perspective. So once we have that in place and that's maybe a kind of a common air recognition, but often kind of forgotten, especially with with kind of that that, FOMO that there is around AI and and everything that's happening. There's a lot of excitement. We need to make sure we pick those right use cases to to focus on. And the second one is is really around automation. We talked about that earlier. And I think there are common benchmarks in the industry. I've seen that in multiple research being published about data engineers spending more than 50% of their time on data pipeline maintenance. I think that that really has to be reduced. I think every repeated process they work on, especially on solve problems, things like data connectivity or you know, managing incremental

loads and,

doing deduplications

of of data and whatnot wherever possible. These are the areas that I would automate the most because this is what's gonna, a, improve our data quality, but b, also free up our data engineers to be able to tackle the higher problems and and then really get to the point where they can build their AI solution. So I think the skills and knowledge for the most part there, there's obviously more tooling that can always be added. But for the most part, kind of thinking about automating,

along the lines of the top value use cases, this is kind of the, the starting point for me.

And as you're working with teams who are trying to tackle this new set of challenges

and try to

stay up to date with

the rapidly shifting ground on which they're trying to run, what are some of the most interesting or innovative or unexpected ways that you're seeing teams address the data needs for AI applications?

Yeah. I'm not sure it's, it's an unexpected way, but I think most interesting way, at least in terms of the value of those those ways I could provide data teams is obviously by using AI, especially to address the need to accelerate data needs for your applications. Use AI for AI and really I can be using every step of the process. If it's using just the example I just talked about around automated data pipelines, one of the interesting data points that came out of the survey was that 83% of companies expect to integrate more data sources

for their AI solutions next year. And so that would mean that data engines will now need to build new data connections, new data pipelines against data sources they haven't connected to so far. And so

using AI for that, for example, could be really an accelerator for them. That's actually something that we invested in quite a bit in Boomi. We we built what we call a data connector agent and all you need to do as a data engineer instead of sifting through API documentations

and and coding against different pagination settings and defining how data was going to be loaded to target. And then applying on top of that the whole framework to a to module a solution, you can now use our data connector agent to just generate a connector for you. All I need to do is just copy paste the API documentation URL, and then the data connect agent

goes through that, read the settings, provides

the connection settings for you. You can, of course, modify it to your own liking if you need to, but otherwise, you can just keep on going and get your pipelines up and up and running in just a few minutes. And then that same examples can be applied to other steps in the in the data pipe building process. And, of course, in the way that we can also,

monitor it and and define the quality of it. So there's lots of opportunities in that process to introduce AI. I'm I'm personally excited about the being able to take that data connector agent I just mentioned to the next level and being able to not just build a connection very, very quickly, but also maintain data pipelines automatically with AI. So whenever,

your source API change or or your your schema drifted,

being able to automatically handle that with AI, I think that's gonna be, you know, a really exciting advancement that AI can bring to help us accelerate as engineers.

And I'm looking forward to see how our organization is using that.

And so as you continue to invest in

either the work that you're doing at Boomi or the work that you're doing with customers and keeping abreast of the ecosystem,

what are some of the predictions or trends that you have for

the ways that AI is going to continue

impacting

the

systems and teams that are responsible for data?

Yeah. I I think data teams have been traditionally focused on on building data pipelines and and the infrastructure to support analytics.

And in the past year, even before the rise of AI, I think they realized that they need

to really deliver more ROI with data. And so they started getting closer to the concept of of activating data. It mostly took the form and shape of what's called reverse ETL, how we can push data that we enrich in the warehouse back into application systems and and processes. But it always remains somewhat siloed or limited with specific use cases.

And and again, as I mentioned earlier, I think most of the business process automation was happening by other teams, by the application integration team that sits under IT, but is not really part of the data team. And so I think what will happen now, while the trends that they were seeing is that because

the AI pipeline development will naturally fall on the responsibility of the data team, but the knowledge and the understanding of the business process automation

is mostly sitting with with the,

IT integration team. I think what we'll start to see is is more and more convergence around those, those two teams kinda coming together. We've seen a lot of convergence around tooling, and I think now we'll see that around the way the IT organization is structured, bringing those two teams together and and finding the higher value use cases where data can be activated. So I think that's gonna be one a, a one significant trend. And and the other one we've touched upon that a little bit earlier is is the fact that agents are going to be

spreading the business in in a very,

unorganized fashion necessarily,

and and they will all carry a lot of values. It's gonna be a sprawl of AI agents,

being built into business.

And I think we'll see more and more tooling and and solutions that they will help you get government manage those,

those AI agents similar to the way we have toolings to to manage your data. So these are two trends that they, at least from a brewing perspective, we've seen quite a bit, and we're we're investing in from a solution perspective accordingly.

Are there any other aspects of the survey details and the insights that you gained from that or this overall space of data preparedness for AI that we didn't discuss yet that you'd like to cover before we close out the show?

I think we we covered a lot. You know, I think eventually

one may think, you know, the challenges are are the same challenges, say, we've always had around data management. And maybe that's the kind of the biggest takeaway of this whole discussion. Yes. We we've always faced similar data management challenges and and try to solve them to best of our abilities. But I think now we realize that AI would just expose and and and scale those issues that where we have bad data in in a way that's gonna be so much so much more impact for the business. And so that is why,

we really need to get ahead of that as as quickly as we can. Because of that black box, we need to make sure we we control what we can control and then deploy it responsibly,

with with the right tracking and and monitoring behind it. So I so that that's the kind of the biggest takeaway. Workloads are going to expand for data teams

and using AI to facilitate that as much as as possible is is going to be, I think the only way out because data teams are probably not going to expand much, if anything shrink. And so they'll have to be really smart about the way they leverage AI toolings to be able to serve those AI solutions.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Oh, wow. That's a good question. I think data teams have been traditionally focused on, on building data pipelines or

trying to deliver data as quickly as possible in the most efficient way. And what we've seen in the past few, maybe months, even a year or two is we're entering an era of consolidation.

And I think that's happening because data teams realize that finding the the next best

option to do a specific task that introduced so much more complexity into their overall platform. And so

they're now pivoting back to a consolidated kind of motion, and let's have a platform that can do more. Maybe it's not necessarily covering

all the areas of the automated pipeline to, to the best possible

solution, but it it's at least a giving me the option to move forward without having to a, to maintain

too much process behind it or to kind of be buried in too much complexity. So I think as we continue to see that consolidation, what we'll need to watch for, and these I think the tools that will be kind of emerging as the winning tools,

is the tools that are able to offer users that ease of use, that they were able to gain with kind of point solution that were very good about something specific, but at the same time, still maintaining the ability

to

understand what a tool is doing, especially when

AI introduces

such a big blind spot or a black box for us in many scenarios. At least understanding what we're doing on the data engineering side is going to be critical. And so getting that full transparency with that ease of use, that's something that not a lot of tools out there are able to

Thank Thank you for listening. So I think this will be a good way to check out our other shows. The data engineering podcast covers the latest on water data. The

podcast. I don't know how to, it's

to the show. Yeah. Find the balance between even the show. The show. And if you've learned something or tried out a project, we'll thank you very much for taking the time today to join the engineering podcast share.com.

Research that you and your team have done and some of the insights that you pulled out of that. It's definitely

a very, interesting and necessary

place to be focusing on and helping organizations

come to grips with what they need to be doing looking forward because the whole world is changing whether we like it or not. So thank you for helping folks, get a little bit ahead of that, and I hope you enjoy the rest of your day. Likewise. Thank you. I enjoyed that conversation. Thank you.

Data Engineering Podcast