From Data Discovery to AI: The Evolution of Semantic Layers

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details. Your host is Tobias Macy, and today, I'd like to welcome back Shinji Kim to talk about the role that semantic layers are playing in the era of AI. So, Shinji, can you start by introducing yourself for anybody who hasn't heard your past appearances?

Sure. Well, thanks for having me back here, Tobias. Really excited to chat with you again.

So hi, everyone. I'm Shinji Kim. I'm the founder, CEO of SelectStar.

SelectStar is a automated data discovery and governance platform,

for cloud data warehouses, data lakes, and pretty much all your data ecosystem.

That's kind of what we do primarily,

and,

yeah, I think through data engineering podcast, we've been chatted about

overall, like, needs of data discovery where it can be applied for both data democratization

and data governance

in the past. And, yeah, I'm excited to dive into, more of the other use cases that we are starting to find with this world of metadata and metadata management.

And do you remember how you first got started working in data?

Yes. It was a long time ago. Back in 02/2007,

I was a data scientist at or,

at the time, the title was statistical analyst

at Sun Microsystems.

I built a mark of chain Monte Carlo models for sales forecasting,

and also,

kind of like almost like a interactive dashboards that sales and operations teams can use for projecting

their quarter by quarter sales models. That's kind of how I got into data. And,

I guess the rest is history. I've worked with, number of startups as a product manager as well as a software engineer.

In the past, I started a company

prior to Select Star called Concourse Systems

focused on distributor stream processing,

which was acquired

by Akamai,

back in, 2016

and,

started Select Star, five years ago. I would say the the biggest part that kind of got me to where I am, especially related to data discovery and data

governance and data democratization,

is because I've noticed a lot of companies, especially enterprises,

spending a lot of effort and resources on collecting,

storing, and running computes on data, building up all the data lakes and systems and, buying all the tools. But the end users, whether that's a data analyst or

product managers, folks that wants to gain answers from their data or build a new product on top of the data,

usually spend

end up taking weeks to find the right data that they could use for those purposes. And this is why I,

started SelectStar five years ago and, has been the main capability,

that's been, driving the core of our product,

which is around data discovery,

providing the context,

of data and your data ecosystem.

And the last time we talked, data discovery, metadata management, those were very active topics. There There were a number of different companies and open source projects entering the ecosystem around that time,

particularly because of the rise of the modern data stack and the number of different tools that were being brought in to work with data, the variety of data sources that were being brought into the context of data warehousing and business intelligence.

And, obviously,

the term modern data stack has faded from use, and a number of the companies that started around that time have either

changed focus or been acquired or they've ceased to be in operation anymore.

And

now there's the age of AI that is adding new stressors to the different data stacks that teams are running

as well as the requirements around contextual information,

data discovery. And I'm curious what you've seen as some of the main shifts in the ecosystem and in your business over the period since we last spoke.

Yeah. That's a great question and also a big question. A lot of things has happened in the last,

two years in the data ecosystem

and the specific area that, we play in. First and foremost,

we do see a lot of data teams that have been,

I guess, more

efficient

in their operations in terms of their, operational perspective,

tooling, as well as how some of the customers that we have worked with just over the last two years, they have really gone from very centralized data team to more decentralized

data team. And we've also seen the other way of the shift as well, kind of more decentralization

to more hybrid. So just overall, like, we are seeing a more shift from the phase between the data engineering teams, analytics engineering teams, and then data analyst slash BI teams, how they work together.

We are starting to see that. This is also driven by lot of advancements

and new,

features that are being added and,

being more consolidated

on the platform side. So both Snowflake,

Databricks, DBT,

a lot of these platforms are starting to provide a lot more capabilities

than the specific data warehousing or the transformation capabilities,

such as documentation,

data catalog,

lineage.

A lot of these are starting to get merged into as, those platforms also

as features.

So there is definitely the market shift from independent vendors, you know, and, for the, companies having the best of breed tools versus,

getting a, more of a platform support, I would say, has definitely come up number of times. I would say, though, in terms of, like, just kind of, like, bubble it back up on where we stand at Select Star. From the beginning, we believe that, providing this type of single source of truth of your metadata,

how your data is being created, being transformed, and being utilized within your organization

is not something that you should only get the information from one platform. Most of the companies use multiple platforms and would require the cross platform

visibility, a way

to manage and

gain insights

cross platform as your whole data ecosystem. So we are starting to provide a lot more capabilities where you can truly manage and govern those information

across platform and then also sharing,

certain metadata from one platform to another. So we can be the glue or the bridge, and you as an end user are just have to work with one platform

on the metadata management perspective. So that's one part of it. The other big part of it that I guess I missed out left out here was the AI.

So

a ton more

services and products that I see in the market that are from, like, AI analyst or AI, data engineer

to a lot of, I guess, features around copilot,

type of things that we are definitely seeing a lot. From SelectStar perspective,

we also had a lot of updates along with AI, including automatically documenting

all of your data assets without you having to lift up the finger. But you can also merge and easily

approve what is the right documentation

to providing with you, an AI assistant

that can do semantic search, answer any questions that you might have your on your data, create SQL queries, or editing your SQL queries that you're trying to execute but may have, some block head

in and especially with, you know,

AI really getting better at understanding natural language and then being able to also do more,

direct translation,

to the code and SQL that, we use day to day. We are starting to I am starting to see this expansion for how data can is starting to get leveraged even beyond the data teams, with this trend. So this is definitely one area there that we are starting to see a lot of traction and, have a lot of features that, we've built towards supporting this true data democratization

and self-service analytics,

to enable,

more people to use data.

Semantic modeling

as a

concept and as a term has gained a lot of attention, particularly around four or five years ago with the growth of the modern data stack and the idea that you wanted to have

one canonical source of truth for the key business metrics

that previously lived in the BI system, and now you wanted to be able to use across different data clients or data consumption use cases. And there were also a number of overlapping terms around that where there was the semantic layer, the metrics layer, headless BI.

And I'm wondering what you see as the

overlap across those terms and whether there's any notable distinction between them as far as the actual application of those ideas.

I guess to start with the motivation side of semantic layer, like you mentioned, there is the part around, oh, like, let's, we can virtualize all the data sources and you can use one thing to query data. I think that is definitely

still there and there are a lot of companies that does this. Not necessarily

through always through a semantic layer, but ways to just translate

SQL, even because most of the time those queries are primarily designed for physical data querying. I would say with modern data stack, the part that has really bubbled up is defining that single source of truth metric. When you say revenue and when I say revenue, are we actually talking about the same number? Do we actually get the same result? And I think that's definitely one of the reasons why a lot of companies wanted to implement or, have a semantic layer. Now

today, and what I've been seeing in the last, I would say, you know, three to six month or so, is starting to really leverage a semantic layer for

AI analysts

or AI agents to be able to provide a better analysis

or text to SQL from the business perspective. So pre semantic layer, the LLMs,

and, you know, we've seen this a lot in Select Star. Come customers ask a very, semantic, type of question, and we give them SQL query to run, and that's all great. But it gets you so far as more of, like, maybe 75 to 80%.

And that's mainly because and and we've thought about this a lot. Like, why is that what is missing?

We have all the metadata of the customer, and we also know,

like, out of, let's say, 10 different orders table,

which table is the right one to use and which column is the right one to use because we can see the previous query history,

to determine

which one are being utilized the most by whom, their query patterns, so on and so forth. The part that we I've noticed is that as you get to the business level questions, there's a lot more nuance

underneath the question that's not always defined in the SQL layer or the physical data layer. These are logical layer and the,

more of what do we mean when we say active in active users. When somebody asks a question, give me all the

contracts that are pending,

where what what is what is the total number of contracts that are pending,

in in this quarter? Now for LM to understand

and define

whether they should get contracts pending from

a defined column called contracts pending versus if they need to look at a status column

and,

do a aggregation based on contracts pending value. Those are very, like, very direct and simple reasons

why having some of this definition and formula underneath semantic layer can really be helpful because those are not necessarily defined. And most of the time, when you build your data warehouse and your data physical models, you're thinking about reusability

of data and the ways that data can be joined and quarried together, not necessarily,

sufficing every single

business

questions that could be answered and,

and driven on that side. So this is a kind of a finding that I had recently.

Why semantic model is an important

layer

that, you need to have if you want to,

invest in having an AI analyst that can generate and execute

queries behalf of business users. Sorry that got a little bit long, but you also asked asked about what's the difference between semantic layer, metric layer, headless VI. What does that even mean?

I see them all as like different, but, you know, similar. It's all around that area of semantic models. The way that I think about it is that

semantic models are, kind of like the logical model data. Most of the time, this is entity based models. And entity based models is different from, when you do, like, physical entity modeling or Kimball modeling. It is,

on semantic model, it is a lot more important

to have the ways that things can be, like, things are named in in a unique manner as well, because each field you will have to define, let's say, not even just like a not a joint condition or primary key, foreign key. Sometimes you might want to define, whether this is, you know, one to many or many to many, like that type of relationship. But that's just the model side. Semantic layer would be the implementation

of those models. So this can be done on dbt semantic layer or cube or at scale. I think that those are all semantic layer companies and tooling anyone, can use to implement semantic models. And then I think you also mentioned metric layer. I think metric layer is almost like the same as semantic layer, but maybe focused on just calculated

metrics. So you can see primarily the core KPIs

and the ones that always have some aggregation

only. I've, like, recently found out, that, like, Snowflake's

definition

of metrics in their semantic view or semantic model YAML files, it is mandatory

for any metric

to always have an aggregation. If not, then it should be a measure with an expression, for example. So I think there are some people may, get into, that difference as well. And then last but not least, headless BI. I feel like I've heard a lot more about headless BI during the modern data stack era, but headless BI is really just the,

BI capabilities like, you know, querying or exploration

of data analysis, but really exposed as a API instead of having a, you know, visual UI that's, like, really tied to it. I think the biggest part about headless BI that I think about is that it can still be a BI, but there is a very clear separation of concern of the visualization

layer versus the the data layer with, the semantic models.

And the other interesting

aspect of all of these concepts of semantic modeling, semantic layer, headless BI, etcetera, is that data warehousing and business intelligence

as a

use case and an industry goal have been around for decades at this point. There are many established patterns for

doing the data modeling, such as star schema,

data vault, etcetera,

as well as methods for being able to gain better performance in the form of things like OLAP. All of those are intended to

be built around the business entities, business objects, business concerns.

And I'm curious

what are the

differences

that are

added on or the new capabilities that are introduced by virtue of using these semantic models or semantic layers that sit on top of the data warehouse and one level above the core star schema, data vault schema, etcetera? Yeah. I would say the core difference that semantic layer and semantic modeling really brings on

is the business logic

perspective of how,

metrics

should be calculated

and what specific

dimension, time dimensions,

those can be applied to. It is technically doable

from building star schema or others too, but what we see is that it's still on the set of physical data models that happens and the naming of things

or the way that things get joined. It usually requires

some level of aggregation or filtering because the end dimension or the end measure

is a representation

of some type of business,

KPI,

or metric. So that that's really the main difference that that I, would define it. And then there is the part around entities. So if we think about, like, data modeling,

data modeling really comes from understanding, first, there are lots of

the sources

that we define, and then we may pull some of those sources source data to build the original Viscom data models so that it can be joined and queried together

to get the answers. At the same time, entity models are, for semantic layer perspective, usually come from how the businesses,

look at data. So it's starting from what is the revenue and then just as an example. And within revenue, how do we define? Are we doing it based on, like, region or product line? And what are some of the exceptions that we might have to put in? And it it's driven by the business process and the reporting,

that you may need to do. So it's approaching from the other end of the spectrum, and that's why the entity, usually end up looking a little bit different than the physical models,

where,

it gets designed from. I've definitely seen companies that have mainly without necessarily a quote unquote

separate semantic layer like a YAML files.

They've implemented

their own

views and tables and call that a this is kind of our semantic model, and these are the semantic model tables that we should use for BI purposes and so on and so forth. And now bringing us back around to the role of AI in all of this

modeling, the additional layers, the ways that we need to think about the

presentation

and access of data.

What are some of the ways that building these semantic models helps when

brought into the context of LLM use cases, whether that is the English to SQL transformations where everybody wants to be able to just talk to their BI

or being able to use your existing data assets to power things like rag use cases where you want to be able

to feed the appropriate

contextual information to the LLM at request time and some of the ways that you need to think about the additional

attributes that you want to feed into your semantic model when you are building it with AI in mind? I think an easy way to think about semantic layer

for AI is to think of it as it's like almost your configuration

and guardrails

that you're providing the AI to know

how certain things should be calculated

rather than letting it infer

from all the raw metadata and queries that you might have had in the past. I think that's the main difference.

If you have those

metrics dimensions,

some expressions

of those

relationships

defined within the semantic model, it is a semantic layer. It's a lot easier and straightforward,

for AI to just refer to that. And I'm not saying that it's impossible

for rack systems and, having just the AI to run on top of your metadata. And there is, you know, without semantic layer and with if you already have a really clear data model, you may not actually need semantic layer. But most of the time, you're working on top of very messy data where you get to a point

where you're not sure

which tables

should be

the certified tables for AI to query because you're not sure if you've included all of them. And either you might be feeling like, oh, I'm missing something or

I am including something that might introduce some noise. So semantic layer just gives you that it's more clear direction and guardrails. Almost like when you're using Checkatrade,

if you provide Checkatrade

with the you know, if if I'm asking it to,

write an email or paragraph or blog post, if I give them more structure or the purpose or the audience who I'm writing it for, it will give me a lot better result. So I see semantic layer playing that type of role for AI analysts.

We've definitely

from Select Star, as we were testing

semantic

layer for,

specifically for AI analysts, and we've done a ton of iterations with Cortex analysts on Snowflake in particular.

We've seen a lot of step change differences

in terms of quality.

When you provide

the semantic model, especially when you can provide the more complete

semantic model that has not just the metadata, but also

the relationships,

which are the primary keys and foreign keys, what's the joint conditions actually look like, as well as which are the dimensions, time dimensions, metrics, aggregations,

how does that happen,

and sample values. Things like this really make the end result to be a lot more accurate and shows up as you really intended

than, leaving the AI to do as is. I think there is definitely

some benefit of having just

the rack system as well, but it gets to this, you know, when you are trying to get from 80% to

9598%.

Some of these, having these definitions,

will,

play a big role.

And then another

angle to the question of RAG and semantic models is the fact that virtually every database at this point has added some sort

of vector storage capability and vector indexing, obviously, at varying levels of capability and varying feature sets. But I'm curious

how you've seen

that additional

data type incorporated

into the semantic definitions

and ways that you're using that as an additional avenue for

LLMs and AI systems to be able

to access the or or as a starting point for accessing the different semantic models for then understanding,

okay,

either, yes, this is exactly the model I want, and the k nearest neighbor search or HNSW

gave me the thing that was most applicable, or, hey. I landed on this model. But now that I see the additional contextual information, it's actually not what I want, and I need to start over.

Yeah. I think that's a very interesting point, and it is possible to, I guess, generating the semantic layer from the vector model that your database is providing. At the same time, the part that we found that the most accurate and interesting is when you are start if you can bring on any usage data,

historically,

and whether that is coming from primarily so analyst

select queries

or from the usage of the BI

side. So I think that is definitely one way to get there. But again,

most of the time, what defines semantic layer, which is primarily the business logic side, is not always

really parsable from just looking at what the metadata that databases

have only.

And then the other aspect of building semantic models is that it is a non zero effort required to actually create and maintain them. It requires a certain amount of contextual knowledge about the business, the

specifics about what the data means, how it's being applied,

and some of the cases where it can be misapplied.

And I'm curious how you're seeing teams

incorporate LLMs into that discovery and development piece of the puzzle for being able to actually

accelerate the creation and

reduce some of the manual toil involved with the maintenance of those definitions.

Yeah. For sure. And I think the tooling is is really starting to improve in this perspective as well. But even just by using

cloud or Chatrapety,

we've seen teams actually, you know, building their semantic layer just by feeding

in how they are defining their metrics

as well as, like, along with the list of tables and columns that exist to kind of output the, first, you know, version of Yamo files that they, wanted to create. So I think it's constructing the models and also,

maintaining those models. It's is it possible. The part that I would say is, really hard today

is kind of the continue the maintenance of the models as well as

having those models to

stay true as the models get used, which requires

just like any, you know, AI agents or applications,

may,

require the monitoring,

and evaluation,

that needs to happen along with the versions of the semantic

models that that you're implementing. But the part that we

see as kind of like a really,

a good place to start, and this is the part that, we've been spending a lot of time on for helping our customers implementing semantic models, is starting from

the models that you semantic models that you might have already implemented within your BI tool. Within your BI tools, the ones that are your certified dashboards,

dashboards that,

your end users, business stakeholders are using, a lot of them are already connected

to the critical data elements of data warehouses,

data lakes. It it kind of, will give you the subset of the tables that should be part of the semantic layer. I think that's a really great place to start because

you will also see that part of those tables are, you know, well maintained, has, like, the quality checks built in. And so if you can, like, feed in the LOM with, like, those list of, metadata, I think that is a good place to start. And then and then you can build upon that as you look at other dashboards or you're updating it as these dashboards boards also evolve on the business side.

The other piece of the equation when dealing with the semantic layers

is

which underlying technology you should use for actually being able to

maintain and expose the models that you define

where most of them are using some sort of YAML definition for this is the SQL query that translates to this particular metric. Maybe these are the parameters that can be fed in to aggregate along different axes, etcetera.

And there are different engines to be able to actually

execute those. You already mentioned cube is one of the leading ones. Also, it's being marketed as a headless BI system.

You know that the folks at SOTA Data have a semantic layer capability.

Obviously, DBT,

has acquired a,

a metrics layer, and they've incorporated that as part of their product suite. I'm curious what you see as some of the useful

heuristics and selection criteria for teams who are starting to evaluate

how am I actually going to build and expose these semantic models.

Yeah. That's a great question. I would say the first of all, the BI side of the house, most of the time, does have some type of semantic

models. Their their own semantic layer that is proprietary,

so it's just hard to get those logic out. But I think there are some BI tools that allow you to export it or,

can at least for you to, like, maintain and see it. Like, you know, LookML is that case,

hex as a way to, like, implement or accept other semantic models. And then for Power BI and Tableau, like, these are things that you will have to stitch in together, but, like, through the API, like, you should be able to get the business logic that are built on underneath. And that's primarily kind of how at SelectStar, we are, reverse engineering,

decent semantic models and business logic from the BI tool as well. Now if we're talking about the third party

semantic layer, yes, like the the ones that you've mentioned are, primary ones. There is also a vendor called AtScale.

I,

found the approach that Snowflake and Databricks

are taking

also very interesting. So if you think about semantic layer for the perspective of dbt and q, for example, these are a layer that is getting it runs its queries and execution on still on top of the data lake and data warehouse.

You have decoupled it out of BI,

but it is

and and now you put that onto the more, I guess, transformation layer or or some type of,

one more place

where you still have to define the YAML file and define,

those on a third party on top of the, you know, data warehouse or

lake house that you have. What Snowflake is doing

is first, they started with the YAML file as a semantic model to be fed into the Cortex analyst. And now

they are moving towards

allowing users to have a

semantic view

that contains all of this semantic, you know, components of semantic layer. And it's actually a view that you can create out of the YAML file, the previous YAML file for semantic models. But it's a

view that can be queried and acts and lives like another tables and views in your schema. And I thought that was also really interesting because it's a lot more native in that case, and it also follows

the same security measures you have, the roles and permissions and dynamic masking and tagging if you have, and that gets all applied natively. So I thought that was a really interesting approach. Now semantic view right now for the timing is

under public preview. So it's very early, but I think there is a really optimistic promise

there if, you know, you can have that natively on the data warehouse side, and then you can also just point your BI dashboards to run on top of that. And for Databricks, they also have,

I I don't think that there is a public documentation on this yet because it's so early, but they have a Unity catalog of metrics

that looks very similar to also something like Semantic View. But I think they've also recognized the need of SemanticLayer

on top of Unity Catalog in order to power Databricks Genie for AIBI.

So it will be really interesting to see what that would look like and how that,

really impacts the AI analyst performance.

One part that I haven't seen or but is is an I could I think another

worthwhile mentioning is companies like Qube. One of the added benefits that I've seen from the semantic layer, and also I think one of the more of the reasons why DVT brought on transform as, and then it has a metrics layer as a proprietary feature. It's because there is once you have that default, once you have almost like, or like, Olaf cube, you can potentially cache these calculations or aggregations, things that are happening because it's defined that you know that it's gonna get queried. And you can make those queries a lot more efficient,

you know,

but because it's already defined that way. And I think that is almost

right now added benefit when you have semantic layer like like Qube today, but should be

more of the table stakes. And one of the reasons why more companies should consider having semantic layer in their tool chain, I would say.

One of the things that you mentioned

earlier as far as the

work that you're doing when working with systems like Tableau or Power BI is that you said that because you're able to ingest the metadata from those business intelligence systems, you can

reverse engineer some of the business logic that goes into the underlying dashboards, etcetera.

And I'm curious how you're thinking about that as a potential on ramp for data teams who want to more explicitly define these semantic models in a separate technology layer that is divorced from the data visualization system so that they can expose it to more use cases and expose it to more of these LLMs, etcetera?

Yeah. That is definitely the main use case that we're trying to support.

So today, we are actually providing the Yamo file to the end customers so that they can modify it as they want or put it on their GitHub. The part that I would say, like, the role that we initially play today is to give you that quick start and bootstrap the semantic layer without having to do any hard work of looking at all the physical models to figure out what the metrics should be, dimensions should be, and also the filling in the details to make it work. The part where we see is going is as we start putting more support on providing this semantic layer YAML file in many different formats, whether that's for DBT DBT semantic layer, cube, or Databricks, so on and so forth, is continuing to also provide

a system that can maintain

and update

the semantic layer portion.

So this part, I think,

is something that is important. I think in the BI side,

one of the parts that you end up running into is multiple different teams defining their own KPIs and metrics definitions that are different from one another. So there's a discovery aspect of people may not be aware of the metric already defined, but also it's also because they might not have used it in the past. So that is one of the roles that we see that we can also help with for our customers. So today, when we are generating metrics, we will indicate to the user, hey, there is another metric that looks exactly the same. It might even within your same dashboard, They

are

named a different, but underlying measure and the dimension exactly the same. So would you like to combine these or have them as, this name, and here is the documentation? So that's kind of like something that we saw, really adding value because this really allows customers to

understand

the

ways that, kind of metrics and how the data that has been used on the business side proliferated

on its own, and it gives them the opportunity to really organize them and having a good single source of truth of how they want to represent the business side as well.

And in your work of building

Select Star, working with some of your customers, helping them to bootstrap these semantic models, what are some of the most interesting or innovative or unexpected ways that you've seen that new single source of truth, the semantic representation applied either with or without AI use cases?

Yeah. In the beginning,

any customers that were using semantic models

or implemented the semantic layer, we've seen

the production use cases primarily for

embedded dashboards,

but not for their core analytics. It's really with the AI analyst that more

companies are considering to,

have the semantic layers for

their core data mart and,

analytics

use cases. The part that c and, the ones that we start envisioning with some of our customers is how this, like, semantic models to be utilized

for not just for the AI analyst that they've defined, but for AI agents that they are building so that

the agents or could be related to, like, the MCP client. How this can be embedded

in their application

layer is another way that, we are, starting to see that customers are ideating towards. But the core, I would say the use case, it has been, you know, answering business questions. And just on that, I think there's a lot that can be touched that we've seen customers that wanted to and are deploying kind of like their revenue off teams or marketing teams to start using Cortex analyst instead of having to always go to their analyst mix teams directly, as well as for companies that are, like, more running, like, a retail chains for their branch managers to be able to check out the how their store is doing and being able to ask more strategic questions on, what they could do differently or better specifically for, that, location that they are managing, for example.

And in your work of building Select Star, working with end users and data teams who are tackling the semantic modeling challenge, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I think two things. One is, like, starting with the model and getting the first POC, whether that's through forward embedded dashboard case or for

just testing out if, like, you know, your first set of business questions get answered. That part is easy. Now the scaling side

and,

actual governance and maintenance,

like, as your business models change, as you are starting to add more

new datasets or creating more models underneath. How do you actually govern and maintain those? I think that is a a real challenge that is upcoming. I think that related to that, I think the second part that I was also going to mention is just like any data modeling, I think the semantic model can also you can fall into this trap of over modeling the data and that can be a big time suck. So being able to actually put it in practice

and continue to iterate would be is an important part because we've also seen

some of our customers that have implemented a semantic layer in the past. They've spent a lot of effort to model everything.

But, like, how much of their model is actually being leveraged and used today by AI or from their querying? You know, not %.

And then their data team got very tired of, like, you know, keep having to update it, or some of them just weren't, like, fully updated. So once it goes out of date, then it's not also relevant anymore. So I think that that's one main area that we believe that having more of a systematic

way to continuously update semantic model file. I think if it fits a file, I we feel like it's it was a little bit harder if it's if it can be a system or view or if there are ways to to rule API or ways to update that model, I think is is a much better way. I think as an industry, that is just kind of a we we have some ways to go.

And for teams who

are starting to think about tackling this body of work, what are the situations

where you would say that building a semantic model or investing in a semantic layer are the wrong choice and

the reward is not going to be worth the cost. I think it's always comes down to one, is there,

use case that physical model currently cannot solve? Because, like, it doesn't matter how well you document your data,

you're still not getting the right, you know, SQL query back, from your AI, then I think that's definitely

one of the reasons why to implement semantic model. If not, then, you know, if physical model is clean and simple enough, I think then you you don't really always have to do this. Yeah. Otherwise, it's really the yeah. Where the end usage

would really come from to make it worthwhile, which goes back to, yeah, can you test it out, with the end users that's gonna leverage this? And I think the use case, you know, perspective, like AI, I think should definitely be in mind. I just haven't seen other use cases that truly gives you the the ROI of doing the semantic model implementation

beyond that, just because data teams are always very busy getting their work done as well.

You mentioned

some of the work that Snowflake and Databricks are doing to incorporate

semantic modeling more closely into the core experience

of the underlying warehouse engine.

I'm wondering what you see as the potential future for this concept of semantic modeling becoming more of a native construct of the underlying data layer or compute engines and maybe any other

areas of expansion or ecosystem investment

that you would like to see?

Yeah. I think that's a good way to put it. Like, this having that native construct on where the data exist, I think it is a very interesting movement, as an industry.

We'll have to see how well that gets improved

as more customers adopt it. The other big part that I would love to see and that I predict it will need to happen is more of this automated

operational way to update the semantic models. Especially if I think about how semantic models will look like in the future when the end user, primarily the AI agents and the AI applications, not humans. Today, semantic models are fully designed

to be consumed and operated

and edited by humans,

not,

AI agents. So a way to self update

the model as well

as maintain it so that it is more of a system instead of a YAML file probably

is where it needs to go. Now the other piece that I think would be also important is the integration

outside of the semantic model itself. So a lot of semantic layer companies,

like, one of the biggest part that they provide is this, yeah, centralizing,

consolidating, doesn't matter which source and destination you're talking about, like, you know, you should be able to query it using same DSL or natural language. I think the integration

between the BI tools, where the business logics all live,

and and the user interaction, and how that translates and impacts and self updates the semantic layer side, would be kind of like more idealistic

future that I I would envision.

Yeah. I I think that the fact that semantic models

as a technology

have so far largely been

manifested as these collection of YAML files is well, it has a fairly low barrier to entry. I think that it also speaks to the

lack of maturity in the space where that is effectively the only way that you can represent them, and they are not a a core concern of the underlying data data systems. They're more of a a bolt on addendum,

and I think that I would like to see them become more of an integrated piece of the actual core compute capability.

That's right. Yeah. Well, that's very well said. Are there any other aspects of semantic modeling, the

ways that it empowers more of these AI systems and AI use cases or the work that you're doing at SelectStar to help teams

address these complexities that we didn't discuss yet that you'd like to cover before we close out the show?

I think anyone that's actually considering to implement it or the use of AI

will, really need to make sure that they are also not just considering the definition of the tables, columns,

and, you know, dimensions

or metrics, but the other aspect that we also found very helpful and important to add into the semantic layer implementation

were first the relationships

of the entities or relationships between the columns,

as well as the sample values

of and the synonyms.

These are things that initially as on the metadata perspective,

we weren't sure how much it impacts the model. But for AI perspective, just these are things that really made a big difference on,

the quality of the results that we've gotten. So I wanted to kind of share that as a quick tips on whoever that wants to go run a semantic layer for AI use. And I also say this because some of the comments I've got in the past when I've talked about this on LinkedIn was that, oh, like, I tried it but didn't really work. AI analysis isn't that good anymore. And I would say that's mainly because you really would have to look at, well, how your semantic models look like. If the quality of the semantic model determines,

there's a huge range of, what AI will do based on how complete your semantic layer is.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and the ways that you're helping to address this challenge of semantic modeling. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The data management and data governance side, a lot of the things that we have now seen now being in the business for more than five years. We see the, biggest gap after getting the good amount of top tooling is always around, like, actually implementing

it and operationalizing

it, putting it into processes, and having everyone to actually

leverage the tool. And I think this is an area that I cannot say that this is like a segment of the tool that we need in the market, but more so in the area of where AI agents and how internal teams can really

make the tooling embedded more on to day to day of the employees and, everyone that they work with. So that's kind of like what I would say.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences

and hard won lessons about semantic modeling, particularly in this context of AI systems and the ways that AI is being applied to the challenges of data analysis and end user use cases. I appreciate, all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thanks so much, Tobias. This was

great. Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.