In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observability, and governance in one workflow. They share how AI agents can now automate documentation, classification, data quality testing, and enforcement of policies, and why aligning governance with user identity and intent is essential as agentic access scales. They also dig into scalability strategies, MCP-based agent workflows, AI governance (including model/agent tracking), and the emerging convergence of big data with ontologies to deliver machine-understandable meaning.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
- Your host is Tobias Macey and today I'm interviewing Suresh Srinivas and Sriharsha Chintalapani about how metadata catalogs provide the context clues necessary to give meaning to your data for AI systems
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the roles that metadata catalogs are playing in the current state of the ecosystem?
- How has the OpenMetadata platform evolved over the past 4 years?
- How has the focus on LLMs/generative AI changed the trajectory of services like OpenMetadata?
- The initial set of use cases for data catalogs was to facilitate discovery and documentation of data assets for human consumption. What are the structural elements of that effort that have paid dividends for an AI audience?
- How does the AI audience change the requirements around the cataloging and presentation of metadata?
- One of the constant challenges in data infrastructure now is the tension of making data accessible to AI systems (agentic or otherwise) and incorporating AI into the inner loop of the service. What are the opportunities for bringing AI inside the boundaries of a system like OpenMetadata vs. as a client or consumer of the platform?
- The key phrase of the past ~2 years is "context engineering". What role does the metadata catalog play in that undertaking?
- What are the capabilities that the catalog needs to be able to effectively populate and curate that context?
- How much awareness does the LLM or agent need to have to be able to use the catalog effectively?
- What does a typical workflow/agent loop look like when it is using something like OpenMetadata in pursuit of knowledge that it needs to achieve an objective?
- How do agentic use cases strain the existing set of governance frameworks?
- What new considerations (procedural or technical) need to be factored into governance practices to balance velocity with security?
- What are the most interesting, innovative, or unexpected ways that you have seen OpenMetadata/Collate used in AI/agentic contexts?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata/Collate?
- When is OpenMetadata/Collate the wrong choice?
- What do you have planned for the future of OpenMetadata?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- OpenMetadata
- Hadoop
- Hortonworks
- Context Engineering
- MCP == Model Context Protocol
- JSON Schema
- dbt
- LangSmith
- OpenMetadata MCP Server
- API Gateway
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. WHOOP and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's Migration Agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into weeklong success stories. Your host is Tobias Macy, and today, I'm welcoming back Suresh Srinivas and Sriharsha Chintalapani to talk about how metadata, catalogs, and platforms provide the context clues necessary to give meaning to your data for AI systems. So, Suresh, can you start by introducing yourself?
[00:02:10] Suresh Srinivas:
Hi. I'm Suresh Shreyiwas. I've been in data space for a long time. Started my journey with building Huddl as a core team, a core team at Yahoo, and went on to take those systems that we had built and took them to the enterprises to solve big data challenges around 2011. And towards the end of my journey at Hortonworks as a co founder, I realized that data was actually very challenging for the enterprises. So I wanted to understand those challenges and joined Uber, a world class data company, to understand the user side of all the platforms that we had built. And some of the learnings from that is how Open Meta Data started. Now, you know, I'm a co founder of OpenMeta Data open source project and also building a company called it around OpenMeta Data. And, Tobias, we are super excited to be back again.
And hello to all the listeners of the podcast.
[00:03:09] Tobias Macey:
And, Harsher, if you can introduce yourself as well.
[00:03:13] Sriharsha Chintalapani:
Definitely. So I've been in the data infospace for almost two decades. Similar kind of journey to what Suresh has been at Yahoo and joined a process with Suresh at HardhomeWorks. A lot of data challenges, a lot of open source experience kind of motivated us to kind of start OpenMeter Data, the metadata platform that we're building in open source. And I've been a co founder and CTO of Collate, the company behind OpenMeter Data. And we are excited to share our story and all the all the things that we have done since we last met. So, yeah, thanks for having us.
[00:03:43] Tobias Macey:
And going back to you, Suresh, can you share again how you got started working in the overall data space and why it is that you've stayed there?
[00:03:52] Suresh Srinivas:
Yeah. So, you know, data is transformative, right? It can transform the societies, you know, create new innovations. Harsh and I have been, you know, in data space for that reason. And yeah, so, you know, you've seen, you know, what had to brought to the data space, Before had to storing large amounts of data and processing large amounts of data was not possible. Right. And through, you know, the solution that is based on commodity hardware, it made data accessible, right, for storing and processing large amounts of data, understand the world around us, right, through data. And so the potential of data is what I'm super excited about.
[00:04:35] Tobias Macey:
And Harsh, how about you?
[00:04:37] Sriharsha Chintalapani:
Yeah, so data has been the kind of core backbone of all the, pretty much all across the industries. That's what driving many of the innovation that we are seeing, including the LLMs and AI, the work that the industry is doing right now. And there is many facets to the data itself, right? There is infrastructure problems. That is how do you scale? How do you process app database of data efficiently? And how do you store them? That's part one. And another part of that is how do you actually make it usable to your organization? How do you actually extract the business relevance, the insights efficiently and educate your users and transform the business that is operating now. So there is many challenges. That's why, you know, you solve the infrastructure problem, then there is education problem, there is insights problem. So there is no lack of busy, lack of innovation in data. So that's what kept me going in data.
[00:05:28] Tobias Macey:
And so as you mentioned, you're both working on open metadata and you also have Collate as a managed offering with some additional features there. The overall space of metadata catalogs and metadata platforms really saw a pretty substantial growth in the 2020 through 2022 time frame as this whole modern data architecture started to take shape and all of these different workflows were scattered across 10,000 different tools, so you needed a way to stitch that back together. There's been a decent amount of consolidation and rearchitecting post the modern data platform, and And I'm just wondering if you can give your perspective on the way that the metadata catalogs and metadata metadata platforms have shifted in terms of their roles and use cases, especially as generative AI and agentic use cases have really exploded in the past two years?
[00:06:28] Suresh Srinivas:
Yeah. So let me step back a little bit. When we were solving data challenges that we saw in many organizations as customers of Hortonworks and at Uber, we realized pretty early on, right, from first principles when we looked at the challenges with data, three things stood out for us. First, context of metadata is important for people to understand their data and, you know, do things with it. The second thing we realized is data as it got democratized self-service, right? And nearly one third of people within an organization are in some way, shape or form, they are data practitioners. They are using data for their day to day decisions.
And we saw that these people were disconnected from each other and they were not collaborating with each other, which was the cause of many other problems that we saw in the space of data. And then finally, the third thing that we saw is people, you know, who are exceptional with data were spending a lot of time on mundane, you know, tasks, right, of cleaning up the data, documenting the data, things like that. We saw that this needs to be automated, right, in order for people to make the best use of data and create outcomes with data, right? So that's our learning with which we started the Open Metadata Project, right? Now, at the heart of all of this is metadata, right? It's the context that it provides to people, right, to enable full understanding, right? When I say full understanding, what data we have, which is the right data to use, whether it should be used in a certain, you know, use case or not, right, for from a governance perspective, is the data ready for it to be consumed, right, to create an outcome? Can you trust this data? All of this are, in our opinion, the complete context of data. The second thing is, once you have this full context of data, you can build a lot of tools and automation to enable them, right, using this context. And so in order to showcase that, in open metadata, we actually build a unified discovery, observability, and governance, right, as workflows that were enabled by metadata. So that's how we started, right? So when we started, we said metadata is at the heart of solving all the problems. And now it has taken a new form, right? Context is king and context is important for AI. So AI is just another tool as far as we are concerned. So for four years, we've been saying that metadata is at the heart of solving many of the data challenges, right? Managing the data, organizing the data, understanding the data, using the data to create, you know, the right outcomes. Everything starts with metadata. Right? So that's how, you know, because metadata is so important, we ended up four years ago building a project from the scratch. Even though we had built many metadata platforms, we decided that we should build this metadata platform the right way with the right control vocabularies, schema first, API first approach to metadata.
And so we ended up, you know, starting open metadata project to build a metadata platform, right, to enable not only the unified experience that we built within open metadata, to also enable all other kinds of tools. Now AI is a big consumer of this context, metadata context.
[00:09:42] Tobias Macey:
And the last time that we spoke early on in the life cycle of open metadata, I think maybe even before you launched the Collate platform, was back in November 2021. And I'm wondering if you can just summarize some of the bigger changes that have occurred in that platform and in the offerings and applications of that system over that time.
[00:10:06] Sriharsha Chintalapani:
From a community itself, community has been growing crazily. I think we reached around 12,000 members and not just the users per se, but also the contributors from various different companies we have right now for the contributors. And one of the core principles from very beginning is, hey, we're gonna out innovate in the metadata space in open source. Right? And that velocity has continued to be even at this scale. It is what's surprising to us to actually keep up with the scale. You know, platforms goes up, it becomes slower, but we are continuing to keep up the field with the with the phase. Based on the innovation here, I think our average contributions are around like 10 to PRs or more in a month. And the platform itself is in top OSS projects across the data space, right? In terms of foundational themes, the features that we developed, what we have started with metadata specifications and APIs now transformed into a unified metadata platform. So when we say unified metadata platform is not just your discovery experience, but also data quality and observability and your governance. All of these features coming together and all of these different personas coming together. The product has now deployed across 3,000 organizations that we know of. Again, you know, these are the community members that are coming and reaching out and telling us that stories. So the product itself became the metadata platform and it's getting adapted across the worldwide in large organizations.
So that's been the growth that we have seen. And in terms of feature compatibility and everything else, I don't think there is any other platform that even come close to it. What we offer around 120 plus connectors, pretty much you name it, the service and data infrastructure, we probably already have a connector for it. So that actually also enabled our adoption as well.
[00:11:43] Suresh Srinivas:
Yeah, I think one comment I would make is last time we met, Arsha was saying we have hundreds of community members and 30 contributors. You can see the project has had tremendous momentum, like Harshav was saying, from hundreds to, you know, now we have 12,000 plus community members from 30 to 300, close to 400 contributors now. And we have been very busy at work, right? This has been really amazing, the velocity of the project. We have around 180 releases in last four years. That's a lot of releases, right, and a lot of innovation.
[00:12:23] Tobias Macey:
One of the key aspects of what we're talking about now is the role that open metadata and these metadata platforms play in the challenges of context engineering for LLMs and agentic use cases. And I'm wondering how the introduction of those uses and their broad adoption with caveats have really shaped the trajectory of the work that you're doing on open metadata and the ways that other people in the space are thinking about the role of metadata systems for these use cases.
[00:12:58] Suresh Srinivas:
Yeah. So if you see LLMs are powered by data. Right? How do you power LLMs with data? Right? Context becomes very important. The meaning, right, that LLMs gather out of data becomes very important. So open metadata as a context for, you know, that provides context of data to LLMs is how we are empowering LLMs within the enterprise organizations to use AI, right, with the data within the enterprises. So it's a powerful enabler of LLM AI use cases, right, around the data within the organizations. Second, LLMs themselves have been very interesting for us. If you remember four years ago, you were talking about automation is so important, right? And we were building automation frameworks of previous generation, right? With LLM, these automations are supercharged, right? You can now understand, you know, governance policy document, right? You can actually generate documentation, right, without needing human beings to do it.
I think LLMs as a way to manage and organize the organize data, right? It's it's sort of like a it's a it's very interesting. AI, you know, getting data ready for AI. Right? And, you know, LLMs as a way of supercharging the automation that we had in mind four years ago, that we could not think of that we would be at a place right now where LLMs can actually automate a lot of data organization management. That is what, we are super excited about.
[00:14:31] Tobias Macey:
I think it's also very interesting how these metadata systems when they were first in their growth phase in the early twenty twenties. Feels weird to say that since we're now at about the midpoint of the twenty twenties. Their primary consumer and audience was humans who were trying to find data across all of these scattered systems and understand where it came from and how it was manipulated on the way to get there. You mentioned the importance of governance and being able to colocate that with that discovery element to understand what data do I have, what are the access controls around it, and what are some of the structural elements of the human consumption focus of these systems that have paid dividends as we really onboard these LLMs and agentic use cases? And what are some of the ways that the context queues and structural semantics of those, interfaces have needed to change to adapt to these newer access patterns?
[00:15:31] Suresh Srinivas:
Yeah. I think I think it's a it's a question that we need to discuss over many other questions. Right? It's a, you know, the the, you know, there are a lot of details to be discussed here. One thing that I would say is when we started, you know, metadata as as the heart of this metadata platform to enable, right, bringing different practitioners, right, be data engineers, data scientists, business analysts, governance folks, business users, we wanted to bring all of them together around the data. But each of them require different kinds of context or information, right, that they want to consume and different kinds of workflows that they need, right? A data engineer requires a different workflow in terms of metadata, right? For example, observability is very important for for a data engineer versus in case of a business user understanding the data, right, when to use and and those kinds of stuff, and only looking at the information he needs, right, not looking at all the underlying plumbing and things like that was important. So providing that that information to the business user and the corresponding workflows is how we enable the collaboration layer, right, in open metadata, bring all these different people, give them the right, you know, context and information, right workflows. Right? Now what has happened is it's very interesting. This information is also important for AI. Right? And these workflows are also important for AI. So what was built for human beings now is being used by within our own platform, these AI agents, right, to enable them with the right context, enable them with the right tools and workflows. Right? That's you know, that is what has happened. So and a lot of things that human beings were doing now can be automated with AI. So that's the evolution that that we are seeing that we are excited about. So that is that is where we are going.
[00:17:22] Tobias Macey:
One of the challenges too with these agentic systems is that the rate of interaction is much different than with a human. What are some of the scalability challenges that you have encountered as you're onboarding more of these workloads onto Open Metadata as a consumer?
[00:17:42] Sriharsha Chintalapani:
Yeah. I think that's fundamentally when we actually develop Open Metadata, the architecture choices that we made inadvertently kind of paid really well into when we adopt all of the LLM agents and whatnot, right? To start with, actually expressing the metadata in a standard schemas and JSON schema and the relations actually help us not to throw the entire knowledge graph at LLM, rather very focused one. So the user is asking a question of, I belong to a finance domain and I want sales data. So instead of throwing millions of data assets at LLM, we focus on, hey, this person who is asking belongs to a finance domain, that relation, and within that, what content is there that satisfies that particular question. Right? So that's where we're kind of focusing on in terms of how do you scale all of these questions. And kind of fundamentally, there's the scaling question of Open Metadata itself. You know, we're coming from Uber, Yahoo, worked on Apache Kafka, Strom and all the distributor system. A lot of that knowledge, a lot of that learnings actually festive us and helped us to build a metadata platform that can scale, not just for human interaction, for LLM interactions. At the end of the day, either human asking or a bot asking, LLM asking, they are gonna go to the APIs. So that part we scale really well from the day one.
[00:18:55] Suresh Srinivas:
The the second challenge that I also see with the AI approach is you've seen in analytics, right? People have optimized data set, you know, data for answering the questions, right? So there are so many derived, you know, tables that are tuned for answering the question, right? And some other times the answers are already there, right? You already have a dashboard, right? You already have, you know, some report that is already generated. One of the challenges that I see is if we take the brute force approach of, you know, you do your conversational interface and start with, you know, asking a question and then that question turns into a query and that query needs to be run right on your data warehouse or, you know, any other your data systems.
It's going to result in explosion of of workload on your database systems. Right? Now one of the things that we are you know, we can do because we have all this context in our unified knowledge graph of metadata is if you're asking a question, it can actually say, hey. By the way, you're asking this question. You don't need to build a, you know, dashboard from the scratch. For this question, there already exists a dashboard. Would you like to see that? Right? It's already been certified. Right? Some human in the loop has looked at it. So mapping to already, you know, the artifacts that you have that answers your question is going to be very important. Otherwise, every query is going to run on your database and your, you know, snowflake bills and, you know, is going to, you know, just explore. Right? So just mapping to what has already been created, understanding the question, and guiding the user to that is gonna be very important. So those are the kinds of things that we are doing.
[00:20:35] Tobias Macey:
One of the interesting aspects too of the application of data to these AI systems is that it's it's no longer a one way street, and we're actually using a lot of these agentic capabilities and LLMs in the pursuit of all of the data engineering work that we're doing. And one of the perennial challenges since we first started having data warehouses back in the eighties and nineties was the documentation challenge of what is this data? Where did it come from? Why do I care about it? In particular, how do I make that accessible to a nontechnical audience? And I'm wondering how you're seeing some of the boundaries shift as far as agents and LLMs as consumers, but then also bringing them inside the system to be able to contribute to some of that documentation and context generation for those other consumers, be they human or agentic?
[00:21:31] Suresh Srinivas:
Yeah. So, Tobias, if you see, last fifteen years, we have seen people working hard to get data ready for people itself has failed. Right? You see constant data quality, data observability issues, lack of trusting data, and all of that. Right? And access control, you know, today in many organizations is a lot of work. Right? Giving right access to the right use cases to the right people itself is a huge amount of work that is not automated well. Now imagine you have all these agents that are coming in, right, that are going to be accessing the data. I'll talk about the security challenges of agentic, you know, agents, you know, accessing your data at a later point in time. But when, you know, the way people are using the data when it is moved to agents, right, AI applications, there's gonna be an explosion of consumption, right, of data. Now in order to make sure what agents should get access to the data, what kind of context they need, right, to do the job the right way, right? And all of this, right, is going to be super challenging.
And we believe that people getting, you know, data ready for people itself has been challenging and error prone. Now people trying to get, you know, data ready for AI is not going to be scalable, right? And so from that perspective, the only way to get data ready for AI is to use AI to make the data ready, right? So it's not going to be humanly possible without that. And so we've been doing a lot of agentic, you know, we have an agentic framework now built on top of Open Medata to get the data ready for AI. Now, what does this look like, right? The moment you connect, let's say your Snowflake or any of your, you know, various tools that you use in your data ecosystem, what was not possible in the past is now possible with Gen AI, right? So we automatically not only get the technical metadata, we document it, right? We can document it because of the unified knowledge graph of, you know, what the table is, what the column name is, how the lineage is, and who is using it, what are the social signals around the data?
All of that is used for documenting the data. And that's how we get the documentation to much higher quality than just standalone looking at your table or columns. Right? We look at all aspects of metadata. We are also doing few things, which is pretty unique to open metadata. We have a concept of tiering, right? So you know the eighty twenty rule, 20% of your data is important, 80% is not important, probably should be deleted, nobody uses it, it's self-service created. So we also tier your data to indicate what is important data, what is not important, you know, for you. That way the important data is used for important use cases, right? So when you when an, you know, LLM is searching for customer data, 100 tables come, right? As a result, it should be able to find the right customer table, tiering plays a big role in it. A few other things that we are also doing to build the trust is we automatically create data quality tests. This was not possible before. This was not the kind of automation we could have enabled four years ago. Now we can. And so a lot of, you know, probably months and maybe years of work, right, in big organizations can now be automated, right? It's all, you know, AI agents not only, you know, bringing your metadata, organizing it, enriching it, right, so that people can use it, not only people, even it becomes context for AI that we have created so that AI can use the data. So that is a huge change. Actually, you wanna add anything?
[00:24:58] Sriharsha Chintalapani:
Yeah, I think the same things is auto classifying the data itself, right? So that's part of the agents as well. And one of the unique things that we do is actually push back any changes that are happening to the source system as well. Now if this becomes your OpenMeter becomes your entry point with AI agency enabled. Your data is organized well, or your data is documented, classified. And also, source systems are aware of the changes that you are making here. Now if there is a PI data that I'm tagging, AI agents tag it as a PI data, the source system also need to know this tag policy so that you can enforce access control policies, retention policies, and whatnot. So in itself, the metadata platform of this generation, with the help of AI agents, it's everything getting ready for consumption. So there is huge number of steps removed from, I'm gonna set up open metadata today, and your document, classifying all of these things, all of those are shortened
[00:25:51] Suresh Srinivas:
and your data is now ready to use. The second thing that I also see, Tobias, here is in data organizations, I've always seen this this disconnect where you have people who are building security policies, governance policies. Right? They understand the domain, but they don't understand the data. Right? And the data people, right, the data engineers, they understand the data, but they don't understand the policies and governance aspects of it. And they're always you know, the success of data projects always dependent on these people who are in the middle who understand both of these, the policy aspects, the business side of things, and then the data side of things. Right? And they are not scalable. Right? There are only very few people who can do this. And in large enterprises, they became the bottleneck for translating policies into workflows that get acted upon, right, that get converted into how the process shapes up, right, because of that. Now what we have is agents that can actually take this policy documents and then translate into what needs to happen. So I believe there will be a human in the loop here, but these human beings who are bringing this special expertise now are scalable, right, with the agents that we are building that can understand security policies, governance policies, and translate them in and enforce them, right, on the other side, on the data, right, and translate them into the workflows, right, the processes on the other side.
[00:27:22] Tobias Macey:
Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to data engineering podcast dot com slash Bruin today to get started. And for dbt cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud. The other aspect of using a metadata platform to expose context to these agentic systems is in the name there of context, where context engineering has become the phrase that I hear probably most often over the past few months. And it is, on the one hand, a very nebulous idea of, oh, you just need to make sure you give it the right information. But on the other hand, it's also a very hard engineering problem of, well, what is the right information? How do I determine what that is? How do I make sure that I'm being as parsimonious as possible while also making sure that I send enough information? And I'm wondering what are some of the ways that the metadata platform needs to be aware of the consumer of the information to be able to manage some of that differentiation between, oh, it's a human. I'll give as much information as possible because I they're able to review it and spend a little bit more time with it versus this is an agent. I need to make sure to compact and consolidate the information as much as possible so that the LLM can inference based off of that without completely blowing its context window.
[00:29:18] Suresh Srinivas:
Yeah. I mean, four years ago with open metadata, you know, we said context is very important. Context is very important for people and tools. For us, AI is just another tool. Right? Now, context has become a real buzzword, Especially riding on top of MCP, which is model context protocol. Right? And so context has become a buzzword. What we feel is just like people need the context to use the data the right way. Right? What is the data? What does it mean? When to use it? Is it ready? Are there any problems with the data? Right? And, you know, the the security and the governance policies associated with it. The same is important for AI. Right? AI applications as well. Right? Now what I feel is when it comes to human beings, the context where, you know, a lot of context was provided through documentation and stuff like that. When human beings did not understand, you know, what it means, maybe it is not precise, they would actually ask other people questions about, hey, what does this mean? Am I understanding it the right way? Right? That is not possible with AI agents. And so, you know, if you see even just human beings, Whenever they made assumptions without really understanding the outcome that they created was incorrect, right? It has caused a lot of data quality, reliability issues, right? Now with AI, it is going to make an assumption on whatever context you have provided, right? And then, you know, if you say Apple, it might think it is a fruit versus a company, right? So to us, context is not sufficient, where everybody is saying, you know, we were four years ago saying context is important and all of that. What we have come to, you know, what we have realized is in the world of AI, context is not sufficient, right? Context is giving you context of data what, where, that kind of stuff. In the world of AI, semantics or meaning is going to be very important. And that meaning has to be precise if you have to have good outcomes with AI. Otherwise, there are going be hallucinations, there are going to be assumptions, right? And, you know, there are going to be a lot of wrong answers and you're going to see a phase where people who are just approaching the problem with context without semantics, they are going to have a lot of poor business outcomes. Right? So our realization is semantics becomes important. Right? And now just like context became a buzzword, you're also seeing in last six months semantics becoming a buzzword. Right? And how you do semantics, right, in a machine understandable way. Right? When we built a context of metadata, right, we use controlled vocabulary using JSON schema. That is machine parsible and understandable and all of that. Right? Now, with AI coming in, even the context provided through documentation and structure is not going be sufficient. It has to have some kind of an ontological underpinning to give precise meaning to AI. That is where we feel the challenge of metadata is going to move. Context is not sufficient. Semantics is gonna be very, very important.
[00:32:23] Tobias Macey:
Another interesting aspect of the metadata system being a contributor to the context is that it's also likely to not be the only aspect of context that the agent needs to be able to perform a given task. And I'm wondering what are some of the responsibilities of the metadata platform to be able to ease that transition from discovery to retrieval in the pursuit of performing a given task where maybe I have an agent, I want it to every time there's a new table registered in the metadata system, I want it to go and do some profiling of that table to then populate some additional details in the metadata catalog and then send an alert to somebody to review those changes. And what are some of the ways that the metadata platform and open metadata in particular should be facilitating that workflow to remove the necessity to have a completely separate tool for discovering, okay. Well, what are the connection details to be able to actually retrieve that data, be able to interact with that system, and simplify that round tripping?
[00:33:29] Sriharsha Chintalapani:
Yeah. I think I can take that. So we start with collecting the metadata in itself. Right? So when you are taking this specific example of table creation and everything else, when table is gets created, open metadata is notified after change and table is appears. Right? And from an agentic perspective, we have an MCP server, and one of the the unique things that we have done in the MCP server, we provided semantic search on top of the knowledge graph that we have. And not only that, we provided an access control based, like if a user is asking that question, a bot is asking the question. In this case, your agents are performing autonomous task, are a user asking the question. The permissions can differ on what access they have, so that, you know, what outcomes that they can perform. So with that, now when the table appears, the agent can actually get notified of that and issue start issuing the commands. The commands now, in terms of profiling and other things, just for this example, that's one of the unique things that we are bringing in, right, so which is the unified metadata platform. So you don't have to go into another, you know, tooling or whatnot to build this. It's all in there, specifically in terms of profiling and sending alerts and everything else. So all of these tools are exposed to the MCP server, and your agents can actually perform these things. Now, what more gets interesting is how do you connect even more wider tools, right? What happens if you want to create a data model of an existing table? Probably need dbt. I want to kind of schedule that dbt model next time. So now these agents can actually rely upon the the knowledge graph that we have and say that, hey, there's a customer tables and all those tables. I want to create a customer ARR, which requires data from both these tables. And I want to do it only one time. I want to create a model so that my analysts can actually use it, build dashboards or whatnot. So you can go and use the MCP server of OpenMeetData and understand the the context of the the customers in the artist table and the schemas where that is flowing, create the DVD model, create that DVD model in a GitHub and get it productionized, get it reviewed from your colleagues and get it productionized, all through agent interface that we have.
[00:35:32] Suresh Srinivas:
Yeah. And there are two challenges that I see. The first challenge is, you know, AI working on behalf of a user. Right? That's what, you know, MCP looks like. You can actually pass the right credentials and make sure that the user has access to the data and hence he is getting answer to the data, right, that he should be getting access to. Right? And if he is not, you know, able to, you know, if he doesn't have the access that those answers should not be provided to the user. Right? So these are kind of, you know, AI apps or conversational interfaces agents that are working on behalf of a user with user. But what is going to be challenging is there are going to be this enterprise wide, you know, AI agents, That, you know, let's take, for example, there's an HR AI agent that is provided access to all the HR related information, salary, this and that and all of that, right? What I feel is going be very challenging is and there's a challenge that that, you know, we we are going to see in the enterprises. If you look at chat GPT, it has consumed all the data from the internet. The only guardrail it has is it won't answer certain questions that are related to some sensitivity.
You know, are some nations, you know, policies of, you know, not being able to, you know, putting guardrails, right, to certain answers. But ChatGPT in general doesn't care who you are. It is taking all the data on the other side and it is giving you answers, whichever question you are asking. The enterprise agents are going to be different. So if the HR person is asking this this, HR AI app, maybe they should get top 10 salaries and this and that that kind of information, right? But if somebody else who doesn't have access to this and asks the question, you cannot leak the answer.
And so some of these enterprise wide AI agents will need to understand who is asking the question, what question is being asked, and what is the policy, right, what is the use case, right, of the user who's asking the question, this is going to be pretty challenging. And I think, you know, we can provide the context of who the user is, We can provide the governance policies, security policies as context, right, to the AI agent. But, you know, it's gonna be a very interesting challenge to solve in, next couple of years.
[00:37:53] Tobias Macey:
Another aspect too is that we've largely been talking about there being the two different worlds. You've got your data that is it's not static, but we'll call it as such, particularly in comparison to the very dynamic nature of these AI systems. And so you're using your metadata platform for keeping track of what are all of my data assets. But another piece too is that as AI becomes a larger factor in the overall data estate for an organization, organization, you also need to be able to keep track of and monitor what agents do I have, what models are they using, what are the prompt versions. And I'm wondering how open metadata has been extended to be able to keep track of some of those other types of data assets and interaction information?
And what are some of the ways that you're thinking about which pieces of information do we need to keep in this platform, and what are the pieces that should live in some other system like a Langsmith?
[00:38:50] Sriharsha Chintalapani:
Yeah. I think this where we went, we just merged in a new feature around the AI governance itself. I think a lot of our community asking. Right? So pretty much every organization right now, building an AI agent. Right? And that ground truth of AI agents actually relies on the data that they have. And OpenMeData is the governance platform, a data platform for all of that data. Now they want to know that, hey, which data is getting exposed to which one? And, you know, what are the MCP servers that we are enabling across the platform that are kind of, you know, helping our employees or whatnot? And finally, the AI agents that they are building. How do you actually understand the quality of the AI agents that they are building because the data that is getting exposed to? So again, the problem comes back to the metadata platform. What data assets are getting exposed? What type of data are getting exposed to the LLMs?
Are we training any data that we are storing through these LLMs? And are we giving more capabilities to the AI agent than it shouldn't be? Maybe a support agent should only talk about support, cushions not more than that. So how do you ensure that? So a lot of that becomes a governance challenge in itself, and it comes back to the same puzzle that we are trying to solve. And that's exactly what we are trying to enable, is to how do you actually, you know, understand what AI agents, MCP servers, and LLM models are being used, and then the quality and observability of those AI agents and the data that is getting into these algorithms.
[00:40:08] Suresh Srinivas:
In OpenMediator, when we started, we were thinking about mainly data assets, right? And then from there, it has evolved to, I want to document my microservices here because microservices is the source of, you know, a lot of events that goes into, let's say, Kafka. From there, it goes into data lake. I want end to end lineage. Similarly, API Gateway started becoming a kind of asset that also gets tracked in Open Medidata because, you know, there is data and then there is, you know, data producing agents, right, all of that, right? Everything was now everything is now is captured in open metadata. Now logical step is to document what models you have, right? How are they trained? How are they tested? Right? That becomes a part of, you know, open metadata.
So that's something that we are adding support for. Now, AI agents. What AI agents you have? How do you classify it? How do you, you know, certify it, especially with EU AI act coming in? The metadata platform becomes a natural place to actually connect right this AI agents to these AI agents consume this data, these AI agents do these kinds of stuff, this is how they are classified from risk perspective, this is how they are tested. This testing requires this data and this all of this is done by these workflows. So it becomes a natural place for documenting and putting AI in the context of how it works, what data it consumes within the organization.
[00:41:38] Tobias Macey:
For people who are building AI agents, you mentioned the challenges of governance and how you can feed some of those policy details into an agentic context window, but also with agents being consumers of the data and of the metadata and taking action on that data, what are some of the ways that that strains the existing set of technical and procedural elements of governance frameworks, and what are some of the ways that we extend them to be able to account for this new style of access and the types of capabilities that are being unlocked where there is not a human involved? And how does that change the ways that you think about what constitutes sensitive data versus doesn't?
[00:42:24] Sriharsha Chintalapani:
Yeah. So one of the things that we are kind of investing in dopaminated data, which is the semantics of it. Right? So anyone using a metadata graph or whatnot, you know, they need to understand the relations between the data itself, relations between the data and the teams and the people that they are using. So what we say here is not just the context, but the awareness of the data. Right? If I'm asking from a sales domain, again, going back to customer lifetime value, what exactly that means? If you expose all of the data that you ever have, we are talking about the context, the elements, give everything in my Snowflake, in my data warehouses or whatnot, what exactly the LLM will ground its truth on. How does it know what is the customer lifetime value is? That's why we're bringing in the semantics nature of it so it can understand the customer lifetime value is defined by this way. Maybe a query, it may be a Python script that defined somewhere. And these are the tables it depends on, you know, calculating that metric. So that is not just a simple knowledge graph question, it's actually a semantic question, right? So exposing that level of semantics to the LLMs is what we are thinking is gonna solve, you know, all the limitations that we are seeing in terms of hallucinations, in terms of, calculating the right things from 11 of view. So that's where we are heading in terms of what we are building in our community. Yeah. So, Toby, as you were saying, enabling
[00:43:39] Suresh Srinivas:
AI agents without human involved, I think we are a bit far off from there, right? We need to actually have human involved, develop confidence on different kinds of agents, what data they need to access, understanding what data they are really accessing through audit and all of that, right, and ensuring that the governance is enforced is going to be, in my opinion, initially a manual challenge, right? And I think organizations will have, you've seen, you know, with Claude, right? People are white coding and generating so much code. The pressure of this code generated moves to people who are reviewing it and making sure that it is really right. There are no security holes and things like that. I believe, you know, organizations will start small. Organizations will have human beings involved in it, and this human component will get strained as the number of use cases increases. And over that period, you know, you know, people will develop confidence on where they can automate things out, where they need to pay closer attention to. Right? I feel that is an area where human scalability will be challenged.
[00:44:47] Tobias Macey:
And as you have been building the open metadata and collate platforms, what are some of the most interesting or innovative or unexpected ways that you've seen them used, particularly as we have been moving into this more agentic world?
[00:45:03] Suresh Srinivas:
Yeah. So for me, you know, there are two hats I wear. One is, you know, a technologist. The other one is a startup founder. Right? The startup founder is, you know, really excited about the possibility with technology. As a technologist, I'm always worried worried about the hype versus reality. In Open Metadata Meetup, we were actually doing a demonstration of our MCP server that that we had built, and then someone was demoing with Claude, right, the capabilities of OpenMetadata and how, you know, we expose tools and that can be used by Claude. I was just amazed. Right? Today, we have built all this. I mean, this is not just true for OpenMetadata, it's for all the other tools as well. So we have built this beautiful, simplified, you know, user interfaces where you click on glossary, you go add a term, you fill a form, and then you say, you know, define a business term, right? And then you do it 10 times for 10 other business terms. With Clarg, you could actually combine the world of data or the Internet right along with LLM capabilities with open metadata where you can actually say give me all the banking terms and then it will give you banking terms and then you can say hey I want this, I don't want this term and then you curate your you know with natural language you curate your list of terms and then you can say add it to open metadata right and then automatically all these terms got created, they're documented, things like that. What made me you know, what what excited me about that is the way we are doing UI today where somebody has a goal in their mind and then they have to understand the tool and then they have to go to the tool, click here, click there, add, do things like that will be short circuited in the you know AI enabled world, You have a goal in mind, you express through language and then you know your work gets done, right? So that was an eye opener for me that the way we are developing UI and the way we are trying to train people, a lot of that will be simplified, right? People can express their goal in natural language, get things done. That was a very exciting part of, you know, AI in action.
[00:47:10] Sriharsha Chintalapani:
Yeah, I think from my side, it's not towards a surprise, but actually getting realized our vision to an extent, right? When we started, we said unified metadata platform, we're gonna build all of these things together, all the personals in our nation should land on a single platform like this. With the healthy skepticism, but over the last four years, the adoption of the platform, the use cases that everyone doing, not just, hey, I'm gonna govern the data, I'm gonna look at the lineage, but I'm gonna test it. I'm gonna absorb it. I'm gonna enforce the policies. And we have installations with 15,000 users coming in, tons of test created, hundreds of tons of, assets being monitored. And not only that, I actually realizing, you know, there are a lot of companies who are going and saying that, we have 16,000 dashboards. Nobody knew what they are. With the help of OpenMeta, we are able to clean up all of those things. Now that's a mental clarity. That's you can't put it, you know, dollar price easily on those, right? Maybe you can, but that kind of productivity, you know, it's it's been the it's a vision, but actually seeing in practice and take you know, can we taking advantage of a platform like this and getting better at data, is actually really, really satisfying. On that note, Tobias, last time we spoke, we were building this unified platform on top of this metadata. Right? And like Harshal was saying, people were skeptical.
[00:48:26] Suresh Srinivas:
People thought they are different product categories. But our argument was this is a single workflow. Right? You discover the data. You understand the data. You see if the data quality is good for you to use the data. Are there any data observability or pipeline failures that are happening? And then you want to understand what is the policy associated with access and governance requirements and all of that. All of this required in a workflow. And that's the reason why we made it into a unified workflow, right? Depending on persona that we are building it for. And fast forward four years, now, every tool that was that used to call itself a data catalog is a unified platform for discovery, observability and governance. The observability platforms are like we are also a data catalog and then the quality tools are also saying you know we are we are a data catalog. This consolidation right based on how users workflows unfold right and how to support it has naturally happened for us, right. There was no category creation or something like that. Many different tools now have come to this, this unified is the right way to do it. Most importantly, our users have begun to expect that I don't want three different tools. This is the right way to do it. This is a seamless way of getting my workflows done. Right. So that's a big realization that we have had in the last four years. Feel vindicated.
[00:49:48] Tobias Macey:
You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is ACID compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at mongodb.com/build today. And in your experience of working in this space and building this platform and growing the community around it, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:50:30] Sriharsha Chintalapani:
Yeah. I think I can go first. I think the great learning is from Open Meet Data itself, the community that has grown, the support they have shown. I think you you won't expect just by starting a project in open source. I think that our vision, our mission statement really kind of resonated real well with the community. And the data problems that we have seen, it's not unique to our experience. It is the shared problem that the entire community feels. And that's, again, you know, resonated because a lot of people coming in using, giving the feedback, and keeping up the momentum. You always can have an idea in the mind that, hey, know, we're gonna build this, community will come and all things will be great and, know, but in reality, getting that happen, making that happen is actually extremely satisfying. The challenges is what it is, right? You're developing this product in open source and every organization is unique, and you want to build the platform that actually can fit into, you know, organizations of different sizes. I think to an extent that's, you know, resonated great to the open minded community in itself, which is, you know, we based on schemas and standards that helped us. We based on a simplified deployment model, which we roughly have, like, three to four components that you can deploy across clouds. So all of those combination actually helped get Meta where that is. All the choices that we made, design choice, architecture choices, everything else, actually playing really well into in in the world of AI and LLMs.
[00:51:53] Suresh Srinivas:
Like building from the ground up is actually something that has helped us. Tobias, I I missed that question. What was the question you asked?
[00:52:00] Tobias Macey:
Just what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building this metadata platform and growing the community around it and just working in this space?
[00:52:11] Suresh Srinivas:
Yeah. To me, ideas are not bogged down by physics, but the execution is right where, you know, you you meet the reality. I think, you know, we thought that we were building an extensive platform, but the kind of ideas we are building it for the real data practitioners, The kind of ideas, the refinement, right? And the encouragement that we have gotten from the community has been great, right? But that said, data is a very complex landscape, Right? And building 100 plus connectors, improving it with the help of the community has been challenging and a, you know, huge learning experience.
[00:52:49] Tobias Macey:
And for people who are building these agentic systems or evaluating something like a metadata platform as a component of that, what are the cases where you would say that open metadata or collate are the wrong choice?
[00:53:05] Suresh Srinivas:
Tobias, it's a wrong question. Open metadata is always the right choice. But what I would say is, in terms of organizational expectation, there are two kinds of, you know, you have seen two kinds of organizations. Organizations where the data landscape is complex, right? There are so many systems, many different people, many different use cases. Their journey always starts with discovery okay and for them you know from discovery data quality, data observability becomes a way to transition into more mature data organization The second kind of organizations where you don't need to actually catalogue you know based on the the catalogue definition right because they're they're a small organization they understand what data they have you can probably put it in a spreadsheet and things like that for those organizations starting with discovery is actually not the right choice right. They should be starting with adding semantic meaning to their data, adding data quality observability to their data catalog is not a great use case for them. The third thing that I also see as challenging is a lot of people think that you know you you have a tool like this suddenly your AI agents are enabled your governance is happening and then your discovery is all going smooth. I feel that beyond the tool, there has to be an organizational buy in and there has to be organizational leadership that is invested in this because this, you know, this does require investment, Right? It's just, you know, it as much as the culture of the organization, the processes of the organization that is incorporating this as the tool and the capabilities, that is where we find challenges.
[00:54:47] Tobias Macey:
And as the future continues to move faster than we want it to and continues to be unexpected, what are some of the things you have planned for the near to medium term for open metadata and collate to continue to adapt and support these new and ever shifting use cases.
[00:55:06] Suresh Srinivas:
We build so many features in open metadata every release. I'm worried Arsha will now take a long time rattling out all the things that we are doing near term, long term. But Arsha, go ahead.
[00:55:18] Sriharsha Chintalapani:
Yeah, I think we are in this unique space. We believe like any automation that trying to build AI agents of any automations around data, you need to have metadata graph. You need to have metadata platform like Open Metadata. And one of the things that we are introduced recently is the semantics of on top of Open Metadata. What is the meaning of that metadata that you're collecting and how it relates to your business? We are continuing to invest into coming 2026. Another thing that we are excited about is the AI governance itself. I spoke to many organizations, everyone building an AI agent, right? How do you govern in that poll? And organizations are betting on their engineers to build those AI agents, right? So there needs to be some level of governance, some level of audit that needs to happen. So we will be focusing on those aspects and how to enable open metadata, not just the data, but also the LLNs governance as well. So many interesting things coming in. We are excited. I don't want to sit as a sit right down a lot of these things, but we welcome all of the listeners to come join our Slack. It's ever bubbling, ever busy Slack,
[00:56:18] Suresh Srinivas:
and we have community meetups. We talk at length. We love to get data, you know, users come and say that, hey, this is the problem that we have. Can OpenMeta help solve that, right? Most often, the answer is yes, but we love to hear from the audience itself. Yeah, so from my perspective, Tobias, there have been a couple of waves that help metadata platforms. The first one is big data itself, right? The big data, the tools call and all that where you needed to have end to end understanding of what your data stack looks like, what do you have, what is going on in there. The second way was GDPR, catalogues were the only place where you had the list of data that you have, naturally became a place for governance.
What I believe is the AI is the third wave, right, where metadata platforms are going to be super important. And you are seeing with lot of metadata tools being consolidated into larger a lot of acquisitions, all of that, you know, clearly indicates that metadata platforms are going to play a significant role in this, you know, wave of AI. And today, a lot of that is based on you know they enable context for AI for us context has always been important right just not for not just for AI but for also people for AI we believe semantics is going to be very important and you are going to hear semantics a lot like we talked about before but how how are you going to do the semantics? In my opinion the semantics there is a lot of work that happened in the semantic web right to bring meaning to the web and there is a lot of work that has been done by you know the ontologies and knowledge graph you know this is a time where what I've seen is big data has remained in its own silo and the knowledge graph people have remained in their silo the knowledge graph and ontologies will be married to you know the the big data, the structured data through metadata right and that is that evolution is what we are excited by and when you say semantics people are talking about some JSON blob all the way to RDF ontologies.
We believe RDF ontologies are going to be underpinning of the semantics. Now with that you are going to marry the concepts, the business concepts, the meaning to the whole web of concepts that already exist right, they exist in d CAD, d prod for metadata and in terms of you know business concepts there is a schema dot r there are so many audio ontologies that exist today and so we are doing the work of marrying our you have seen glossary business terms as a way of trying to bring conceptual meaning into the data now we will actually become we'll have ontological underpinning and that's what we are doing when we say semantics is important, that semantics has to be something that AI can reason about, find relationship, know, build understanding for that, we are building RDF ontology ontological support.
[00:59:23] Tobias Macey:
Are there any other aspects of the work that you're doing on open metadata or this overall category of metadata platforms as a key context source for LLMs and agentic use cases that we didn't discuss yet that you'd like to cover before we close out the show?
[00:59:39] Sriharsha Chintalapani:
I'm piggybacking on what Suresh was saying. I think that is the most important thing is the the semantics. Right? So how do we make LLM aware of the data itself? So that is the key aspects. I think we did cover a good amount of it. So I don't know, Sajj, if there's anything else. Yeah, we have comprehensive
[00:59:55] Suresh Srinivas:
metadata, right? In open metadata, right? Covering every aspect of your data landscape and you know other things that play a part in that like microservices, API gateway, AI agents and things like that. It's a constant evolution, right? I think you know today enabling AI with data with the right context and meaning is where we are focused on. But you know, this world is changing so fast. Four years ago when we talked about it, two years, we, Gen AI was not even in our discussion. I think the innovation is accelerating faster and faster, right? What was done in four years probably needs to be done in one year, maybe in half a year in the future. I think we move at a high velocity as a project. We are excited to see how things evolve and, you know, plan for that evolution.
[01:00:46] Sriharsha Chintalapani:
And I think couple of sentences to add there is, so far we talked about metadata and, you know, how we are bringing the semantics into the data and agents. The agents that the teams will going to bill will actually address the data in itself, which is creation of data models in DBT, creation of dashboards and, you know, dashboard services and everything through the natural language interface. I think Suresh touched upon earlier where where we are getting with the the power of AI is I have an idea in mind, how do I accomplish? Right? Some of the users are coming in to OpenMeta today saying that, hey, our team doesn't know what table means. They know what a metric means that they're off, sales forecast, that they will ask. How do you map that and give the answer? Not just the answer saying that, hey, sales forecast, you need to use these tables, but actually giving that you know, going and figuring out and giving the results back to the users all the way there. Now at that point, and Suresh doesn't love this concept from me, but, you know, I term it as, you know, data operating system. Right? So, you know, it's enabling all of these things. And as an operating system, you know where the files are, what comments, and everything else. Like, that's what we are becoming to an extent through the power of semantics and the metadata that we have to enable those end to end use cases, not just about anymore about finding the tables, account the tables, or dashboards or whatever.
[01:02:02] Tobias Macey:
Alright. Well, for anybody who wants to follow along with the both of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as usual, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:02:19] Sriharsha Chintalapani:
Biggest gap is, I believe, not having open metadata installed in the infrastructure. I'll start with that. Yeah. I think not I think as a data teams, we are more focused on our isolated problems we touched upon earlier, right? So, you know, my a my role as a data engineer about pipelines, you know, maybe Suresh has a data analyst problem is the dashboards, then there's business users and everything else. I don't think that organizations are there yet collectively thinking how to manage all of these things together and the outcome based approach. I think where we are with open metadata, I think that's what we are enabling to understand the the landscape and and enable the users to actually get the value.
[01:02:58] Suresh Srinivas:
Yeah, to me, the answer is twofold. I think over last ten, fifteen years, there is obsession about tools, right? And there are so many tools that, you know, people use in their data ecosystem that the bigger picture is lost. People are worried about query optimization, this and that, but there is optimizations that are possible, human, you know, efficiency optimization that will probably, you know, gives you a lot more reward. So I think the big picture is lost. There has been a lot of tool obsession and then of course, right, lot of funding that has gone into tools has created all these siloed, narrow tools, right, that actually sort of takes you away from the big picture on why you are doing this, right? The bigger picture, the business outcomes, the business value creation is lost, and we have gotten bogged down in the tools that needs to change.
You know, natural, you know, next step of that is there is going to be a lot of tool consolidation right there is a cycle of you know some tool that does something really well comes out and then you end up using it but over a period of time people understand it and then it just gets consolidated into you know an end to end more you know end to end kind of workflows for data persona so I feel there is going to be a lot of consolidation and that consolidation also will be accelerated by AI agents. AI agents can integrate with multiple tools so you know maybe you can replace a tool without impacting the outcome that you are creating and these are the kinds of things where probably tools are going to be consolidated and and the data landscape, the data ecosystem becomes lot more simpler and hopefully with agents enabling data practitioners to get 80% of let's say their work done, the higher level thinking about business outcomes becomes important instead of lower obsession of I want to delete the data, I want to clean the data, I want to document the data from there, we elevate ourselves to actually really thinking from business outcomes first.
[01:05:03] Tobias Macey:
Alright, well thank you both for taking the time today to join me and share the work that you've been doing for the past several years and all of the, great progress that you've made on the open metadata and collate systems and the work that you're doing to help empower this next generation of use cases. It's definitely a very interesting and important problem area, so I appreciate all of the focus that you're bringing to that. And I hope you enjoy the rest of your day. Thanks, Tobias. Thanks, Tobias. Thanks for having us. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple podcasts and tell your friends and coworkers.
Intro and episode setup
Guests return: Suresh Srinivas and Sriharsha Chintalapani
Sureshs background and OpenMetadata origins
Sriharshas background and Collate role
Why metadata context matters for modern data and AI
OpenMetadatas growth, community, and releases
LLMs meet metadata: context for enterprise AI
From human-first catalogs to AI-enabled workflows
Scaling agentic workloads and minimizing data warehouse load
Using AI agents to document, tier, and test data automatically
Bridging governance and engineering with agents
Context engineering: parsimony, semantics, and MCP
From discovery to action: agents, MCP server, and toolchain orchestration
Enterprise AI guardrails: identity, policy, and leakage risks
AI governance in OpenMetadata: tracking agents, models, prompts
Evolving asset scope: services, APIs, models, and AI agents
Semantics over context: defining business meaning for LLMs
Human-in-the-loop reality and confidence building
Agentic UX: natural language over clicks, and unified workflows
Market consolidation towards unified metadata platforms
Community lessons, architectural choices, and scaling connectors
When not to start with a catalog and org readiness
Roadmap: semantics, AI governance, and RDF ontologies
Wrapping up: gaps in data tooling, tool obsession, and future
Closing and calls to action