Summary
In this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.
Announcements
Parting Question
In this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Shinji Kim about the role of semantic layers in the era of AI
- Introduction
- How did you get involved in the area of data management?
- Semantic modeling gained a lot of attention ~4-5 years ago in the context of the "modern data stack". What is your motivation for revisiting that topic today?
- There are several overlapping concepts – "semantic layer," "metrics layer," "headless BI." How do you define these terms, and what are the key distinctions and overlaps?
- Do you see these concepts converging, or do they serve distinct long-term purposes?
- Data warehousing and business intelligence have been around for decades now. What new value does semantic modeling beyond practices like star schemas, OLAP cubes, etc.?
- What benefits does a semantic model provide when integrating your data platform into AI use cases?
- How is it different between using AI as an interface to your analytical use cases vs. powering customer facing AI applications with your data?
- Putting in the effort to create and maintain a set of semantic models is non-zero. What role can LLMs play in helping to propose and construct those models?
- For teams who have already invested in building this capability, what additional context and metadata is necessary to provide guidance to LLMs when working with their models?
- What's the most effective way to create a semantic layer without turning it into a massive project?
- There are several technologies available for building and serving these models. What are the selection criteria that you recommend for teams who are starting down this path?
- What are the most interesting, innovative, or unexpected ways that you have seen semantic models used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with semantic modeling?
- When is semantic modeling the wrong choice?
- What do you predict for the future of semantic modeling?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- SelectStar
- Sun Microsystems
- Markov Chain Monte Carlo
- Semantic Modeling
- Semantic Layer
- Metrics Layer
- Headless BI
- Cube
- AtScale
- Star Schema
- Data Vault
- OLAP Cube
- RAG == Retrieval Augmented Generation
- KNN == K-Nearest Neighbers
- HNSW == Hierarchical Navigable Small World
- dbt Metrics Layer
- Soda Data
- LookML
- Hex
- PowerBI
- Tableau
- Semantic View (Snowflake)
- Databricks Genie
- Snowflake Cortex Analyst
- Malloy
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details. Your host is Tobias Macy, and today, I'd like to welcome back Shinji Kim to talk about the role that semantic layers are playing in the era of AI. So, Shinji, can you start by introducing yourself for anybody who hasn't heard your past appearances?
[00:00:59] Shinji Kim:
Sure. Well, thanks for having me back here, Tobias. Really excited to chat with you again. So hi, everyone. I'm Shinji Kim. I'm the founder, CEO of SelectStar. SelectStar is a automated data discovery and governance platform, for cloud data warehouses, data lakes, and pretty much all your data ecosystem. That's kind of what we do primarily, and, yeah, I think through data engineering podcast, we've been chatted about overall, like, needs of data discovery where it can be applied for both data democratization and data governance in the past. And, yeah, I'm excited to dive into, more of the other use cases that we are starting to find with this world of metadata and metadata management.
[00:01:46] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:50] Shinji Kim:
Yes. It was a long time ago. Back in 02/2007, I was a data scientist at or, at the time, the title was statistical analyst at Sun Microsystems. I built a mark of chain Monte Carlo models for sales forecasting, and also, kind of like almost like a interactive dashboards that sales and operations teams can use for projecting their quarter by quarter sales models. That's kind of how I got into data. And, I guess the rest is history. I've worked with, number of startups as a product manager as well as a software engineer. In the past, I started a company prior to Select Star called Concourse Systems focused on distributor stream processing, which was acquired by Akamai, back in, 2016 and, started Select Star, five years ago. I would say the the biggest part that kind of got me to where I am, especially related to data discovery and data governance and data democratization, is because I've noticed a lot of companies, especially enterprises, spending a lot of effort and resources on collecting, storing, and running computes on data, building up all the data lakes and systems and, buying all the tools. But the end users, whether that's a data analyst or product managers, folks that wants to gain answers from their data or build a new product on top of the data, usually spend end up taking weeks to find the right data that they could use for those purposes. And this is why I, started SelectStar five years ago and, has been the main capability, that's been, driving the core of our product, which is around data discovery, providing the context, of data and your data ecosystem.
[00:03:44] Tobias Macey:
And the last time we talked, data discovery, metadata management, those were very active topics. There There were a number of different companies and open source projects entering the ecosystem around that time, particularly because of the rise of the modern data stack and the number of different tools that were being brought in to work with data, the variety of data sources that were being brought into the context of data warehousing and business intelligence. And, obviously, the term modern data stack has faded from use, and a number of the companies that started around that time have either changed focus or been acquired or they've ceased to be in operation anymore.
And now there's the age of AI that is adding new stressors to the different data stacks that teams are running as well as the requirements around contextual information, data discovery. And I'm curious what you've seen as some of the main shifts in the ecosystem and in your business over the period since we last spoke.
[00:04:47] Shinji Kim:
Yeah. That's a great question and also a big question. A lot of things has happened in the last, two years in the data ecosystem and the specific area that, we play in. First and foremost, we do see a lot of data teams that have been, I guess, more efficient in their operations in terms of their, operational perspective, tooling, as well as how some of the customers that we have worked with just over the last two years, they have really gone from very centralized data team to more decentralized data team. And we've also seen the other way of the shift as well, kind of more decentralization to more hybrid. So just overall, like, we are seeing a more shift from the phase between the data engineering teams, analytics engineering teams, and then data analyst slash BI teams, how they work together.
We are starting to see that. This is also driven by lot of advancements and new, features that are being added and, being more consolidated on the platform side. So both Snowflake, Databricks, DBT, a lot of these platforms are starting to provide a lot more capabilities than the specific data warehousing or the transformation capabilities, such as documentation, data catalog, lineage. A lot of these are starting to get merged into as, those platforms also as features. So there is definitely the market shift from independent vendors, you know, and, for the, companies having the best of breed tools versus, getting a, more of a platform support, I would say, has definitely come up number of times. I would say, though, in terms of, like, just kind of, like, bubble it back up on where we stand at Select Star. From the beginning, we believe that, providing this type of single source of truth of your metadata, how your data is being created, being transformed, and being utilized within your organization is not something that you should only get the information from one platform. Most of the companies use multiple platforms and would require the cross platform visibility, a way to manage and gain insights cross platform as your whole data ecosystem. So we are starting to provide a lot more capabilities where you can truly manage and govern those information across platform and then also sharing, certain metadata from one platform to another. So we can be the glue or the bridge, and you as an end user are just have to work with one platform on the metadata management perspective. So that's one part of it. The other big part of it that I guess I missed out left out here was the AI.
So a ton more services and products that I see in the market that are from, like, AI analyst or AI, data engineer to a lot of, I guess, features around copilot, type of things that we are definitely seeing a lot. From SelectStar perspective, we also had a lot of updates along with AI, including automatically documenting all of your data assets without you having to lift up the finger. But you can also merge and easily approve what is the right documentation to providing with you, an AI assistant that can do semantic search, answer any questions that you might have your on your data, create SQL queries, or editing your SQL queries that you're trying to execute but may have, some block head in and especially with, you know, AI really getting better at understanding natural language and then being able to also do more, direct translation, to the code and SQL that, we use day to day. We are starting to I am starting to see this expansion for how data can is starting to get leveraged even beyond the data teams, with this trend. So this is definitely one area there that we are starting to see a lot of traction and, have a lot of features that, we've built towards supporting this true data democratization and self-service analytics, to enable, more people to use data.
[00:09:15] Tobias Macey:
Semantic modeling as a concept and as a term has gained a lot of attention, particularly around four or five years ago with the growth of the modern data stack and the idea that you wanted to have one canonical source of truth for the key business metrics that previously lived in the BI system, and now you wanted to be able to use across different data clients or data consumption use cases. And there were also a number of overlapping terms around that where there was the semantic layer, the metrics layer, headless BI. And I'm wondering what you see as the overlap across those terms and whether there's any notable distinction between them as far as the actual application of those ideas.
[00:10:04] Shinji Kim:
I guess to start with the motivation side of semantic layer, like you mentioned, there is the part around, oh, like, let's, we can virtualize all the data sources and you can use one thing to query data. I think that is definitely still there and there are a lot of companies that does this. Not necessarily through always through a semantic layer, but ways to just translate SQL, even because most of the time those queries are primarily designed for physical data querying. I would say with modern data stack, the part that has really bubbled up is defining that single source of truth metric. When you say revenue and when I say revenue, are we actually talking about the same number? Do we actually get the same result? And I think that's definitely one of the reasons why a lot of companies wanted to implement or, have a semantic layer. Now today, and what I've been seeing in the last, I would say, you know, three to six month or so, is starting to really leverage a semantic layer for AI analysts or AI agents to be able to provide a better analysis or text to SQL from the business perspective. So pre semantic layer, the LLMs, and, you know, we've seen this a lot in Select Star. Come customers ask a very, semantic, type of question, and we give them SQL query to run, and that's all great. But it gets you so far as more of, like, maybe 75 to 80%.
And that's mainly because and and we've thought about this a lot. Like, why is that what is missing? We have all the metadata of the customer, and we also know, like, out of, let's say, 10 different orders table, which table is the right one to use and which column is the right one to use because we can see the previous query history, to determine which one are being utilized the most by whom, their query patterns, so on and so forth. The part that we I've noticed is that as you get to the business level questions, there's a lot more nuance underneath the question that's not always defined in the SQL layer or the physical data layer. These are logical layer and the, more of what do we mean when we say active in active users. When somebody asks a question, give me all the contracts that are pending, where what what is what is the total number of contracts that are pending, in in this quarter? Now for LM to understand and define whether they should get contracts pending from a defined column called contracts pending versus if they need to look at a status column and, do a aggregation based on contracts pending value. Those are very, like, very direct and simple reasons why having some of this definition and formula underneath semantic layer can really be helpful because those are not necessarily defined. And most of the time, when you build your data warehouse and your data physical models, you're thinking about reusability of data and the ways that data can be joined and quarried together, not necessarily, sufficing every single business questions that could be answered and, and driven on that side. So this is a kind of a finding that I had recently.
Why semantic model is an important layer that, you need to have if you want to, invest in having an AI analyst that can generate and execute queries behalf of business users. Sorry that got a little bit long, but you also asked asked about what's the difference between semantic layer, metric layer, headless VI. What does that even mean? I see them all as like different, but, you know, similar. It's all around that area of semantic models. The way that I think about it is that semantic models are, kind of like the logical model data. Most of the time, this is entity based models. And entity based models is different from, when you do, like, physical entity modeling or Kimball modeling. It is, on semantic model, it is a lot more important to have the ways that things can be, like, things are named in in a unique manner as well, because each field you will have to define, let's say, not even just like a not a joint condition or primary key, foreign key. Sometimes you might want to define, whether this is, you know, one to many or many to many, like that type of relationship. But that's just the model side. Semantic layer would be the implementation of those models. So this can be done on dbt semantic layer or cube or at scale. I think that those are all semantic layer companies and tooling anyone, can use to implement semantic models. And then I think you also mentioned metric layer. I think metric layer is almost like the same as semantic layer, but maybe focused on just calculated metrics. So you can see primarily the core KPIs and the ones that always have some aggregation only. I've, like, recently found out, that, like, Snowflake's definition of metrics in their semantic view or semantic model YAML files, it is mandatory for any metric to always have an aggregation. If not, then it should be a measure with an expression, for example. So I think there are some people may, get into, that difference as well. And then last but not least, headless BI. I feel like I've heard a lot more about headless BI during the modern data stack era, but headless BI is really just the, BI capabilities like, you know, querying or exploration of data analysis, but really exposed as a API instead of having a, you know, visual UI that's, like, really tied to it. I think the biggest part about headless BI that I think about is that it can still be a BI, but there is a very clear separation of concern of the visualization layer versus the the data layer with, the semantic models.
[00:16:24] Tobias Macey:
And the other interesting aspect of all of these concepts of semantic modeling, semantic layer, headless BI, etcetera, is that data warehousing and business intelligence as a use case and an industry goal have been around for decades at this point. There are many established patterns for doing the data modeling, such as star schema, data vault, etcetera, as well as methods for being able to gain better performance in the form of things like OLAP. All of those are intended to be built around the business entities, business objects, business concerns. And I'm curious what are the differences that are added on or the new capabilities that are introduced by virtue of using these semantic models or semantic layers that sit on top of the data warehouse and one level above the core star schema, data vault schema, etcetera? Yeah. I would say the core difference that semantic layer and semantic modeling really brings on
[00:17:29] Shinji Kim:
is the business logic perspective of how, metrics should be calculated and what specific dimension, time dimensions, those can be applied to. It is technically doable from building star schema or others too, but what we see is that it's still on the set of physical data models that happens and the naming of things or the way that things get joined. It usually requires some level of aggregation or filtering because the end dimension or the end measure is a representation of some type of business, KPI, or metric. So that that's really the main difference that that I, would define it. And then there is the part around entities. So if we think about, like, data modeling, data modeling really comes from understanding, first, there are lots of the sources that we define, and then we may pull some of those sources source data to build the original Viscom data models so that it can be joined and queried together to get the answers. At the same time, entity models are, for semantic layer perspective, usually come from how the businesses, look at data. So it's starting from what is the revenue and then just as an example. And within revenue, how do we define? Are we doing it based on, like, region or product line? And what are some of the exceptions that we might have to put in? And it it's driven by the business process and the reporting, that you may need to do. So it's approaching from the other end of the spectrum, and that's why the entity, usually end up looking a little bit different than the physical models, where, it gets designed from. I've definitely seen companies that have mainly without necessarily a quote unquote separate semantic layer like a YAML files.
They've implemented their own views and tables and call that a this is kind of our semantic model, and these are the semantic model tables that we should use for BI purposes and so on and so forth. And now bringing us back around to the role of AI in all of this
[00:19:53] Tobias Macey:
modeling, the additional layers, the ways that we need to think about the presentation and access of data. What are some of the ways that building these semantic models helps when brought into the context of LLM use cases, whether that is the English to SQL transformations where everybody wants to be able to just talk to their BI or being able to use your existing data assets to power things like rag use cases where you want to be able to feed the appropriate contextual information to the LLM at request time and some of the ways that you need to think about the additional attributes that you want to feed into your semantic model when you are building it with AI in mind? I think an easy way to think about semantic layer
[00:20:42] Shinji Kim:
for AI is to think of it as it's like almost your configuration and guardrails that you're providing the AI to know how certain things should be calculated rather than letting it infer from all the raw metadata and queries that you might have had in the past. I think that's the main difference. If you have those metrics dimensions, some expressions of those relationships defined within the semantic model, it is a semantic layer. It's a lot easier and straightforward, for AI to just refer to that. And I'm not saying that it's impossible for rack systems and, having just the AI to run on top of your metadata. And there is, you know, without semantic layer and with if you already have a really clear data model, you may not actually need semantic layer. But most of the time, you're working on top of very messy data where you get to a point where you're not sure which tables should be the certified tables for AI to query because you're not sure if you've included all of them. And either you might be feeling like, oh, I'm missing something or I am including something that might introduce some noise. So semantic layer just gives you that it's more clear direction and guardrails. Almost like when you're using Checkatrade, if you provide Checkatrade with the you know, if if I'm asking it to, write an email or paragraph or blog post, if I give them more structure or the purpose or the audience who I'm writing it for, it will give me a lot better result. So I see semantic layer playing that type of role for AI analysts.
We've definitely from Select Star, as we were testing semantic layer for, specifically for AI analysts, and we've done a ton of iterations with Cortex analysts on Snowflake in particular. We've seen a lot of step change differences in terms of quality. When you provide the semantic model, especially when you can provide the more complete semantic model that has not just the metadata, but also the relationships, which are the primary keys and foreign keys, what's the joint conditions actually look like, as well as which are the dimensions, time dimensions, metrics, aggregations, how does that happen, and sample values. Things like this really make the end result to be a lot more accurate and shows up as you really intended than, leaving the AI to do as is. I think there is definitely some benefit of having just the rack system as well, but it gets to this, you know, when you are trying to get from 80% to 9598%.
Some of these, having these definitions, will, play a big role.
[00:23:43] Tobias Macey:
And then another angle to the question of RAG and semantic models is the fact that virtually every database at this point has added some sort of vector storage capability and vector indexing, obviously, at varying levels of capability and varying feature sets. But I'm curious how you've seen that additional data type incorporated into the semantic definitions and ways that you're using that as an additional avenue for LLMs and AI systems to be able to access the or or as a starting point for accessing the different semantic models for then understanding, okay, either, yes, this is exactly the model I want, and the k nearest neighbor search or HNSW gave me the thing that was most applicable, or, hey. I landed on this model. But now that I see the additional contextual information, it's actually not what I want, and I need to start over.
[00:24:44] Shinji Kim:
Yeah. I think that's a very interesting point, and it is possible to, I guess, generating the semantic layer from the vector model that your database is providing. At the same time, the part that we found that the most accurate and interesting is when you are start if you can bring on any usage data, historically, and whether that is coming from primarily so analyst select queries or from the usage of the BI side. So I think that is definitely one way to get there. But again, most of the time, what defines semantic layer, which is primarily the business logic side, is not always really parsable from just looking at what the metadata that databases have only.
[00:25:32] Tobias Macey:
And then the other aspect of building semantic models is that it is a non zero effort required to actually create and maintain them. It requires a certain amount of contextual knowledge about the business, the specifics about what the data means, how it's being applied, and some of the cases where it can be misapplied. And I'm curious how you're seeing teams incorporate LLMs into that discovery and development piece of the puzzle for being able to actually accelerate the creation and reduce some of the manual toil involved with the maintenance of those definitions.
[00:26:11] Shinji Kim:
Yeah. For sure. And I think the tooling is is really starting to improve in this perspective as well. But even just by using cloud or Chatrapety, we've seen teams actually, you know, building their semantic layer just by feeding in how they are defining their metrics as well as, like, along with the list of tables and columns that exist to kind of output the, first, you know, version of Yamo files that they, wanted to create. So I think it's constructing the models and also, maintaining those models. It's is it possible. The part that I would say is, really hard today is kind of the continue the maintenance of the models as well as having those models to stay true as the models get used, which requires just like any, you know, AI agents or applications, may, require the monitoring, and evaluation, that needs to happen along with the versions of the semantic models that that you're implementing. But the part that we see as kind of like a really, a good place to start, and this is the part that, we've been spending a lot of time on for helping our customers implementing semantic models, is starting from the models that you semantic models that you might have already implemented within your BI tool. Within your BI tools, the ones that are your certified dashboards, dashboards that, your end users, business stakeholders are using, a lot of them are already connected to the critical data elements of data warehouses, data lakes. It it kind of, will give you the subset of the tables that should be part of the semantic layer. I think that's a really great place to start because you will also see that part of those tables are, you know, well maintained, has, like, the quality checks built in. And so if you can, like, feed in the LOM with, like, those list of, metadata, I think that is a good place to start. And then and then you can build upon that as you look at other dashboards or you're updating it as these dashboards boards also evolve on the business side.
[00:28:36] Tobias Macey:
The other piece of the equation when dealing with the semantic layers is which underlying technology you should use for actually being able to maintain and expose the models that you define where most of them are using some sort of YAML definition for this is the SQL query that translates to this particular metric. Maybe these are the parameters that can be fed in to aggregate along different axes, etcetera. And there are different engines to be able to actually execute those. You already mentioned cube is one of the leading ones. Also, it's being marketed as a headless BI system. You know that the folks at SOTA Data have a semantic layer capability.
Obviously, DBT, has acquired a, a metrics layer, and they've incorporated that as part of their product suite. I'm curious what you see as some of the useful heuristics and selection criteria for teams who are starting to evaluate how am I actually going to build and expose these semantic models.
[00:29:40] Shinji Kim:
Yeah. That's a great question. I would say the first of all, the BI side of the house, most of the time, does have some type of semantic models. Their their own semantic layer that is proprietary, so it's just hard to get those logic out. But I think there are some BI tools that allow you to export it or, can at least for you to, like, maintain and see it. Like, you know, LookML is that case, hex as a way to, like, implement or accept other semantic models. And then for Power BI and Tableau, like, these are things that you will have to stitch in together, but, like, through the API, like, you should be able to get the business logic that are built on underneath. And that's primarily kind of how at SelectStar, we are, reverse engineering, decent semantic models and business logic from the BI tool as well. Now if we're talking about the third party semantic layer, yes, like the the ones that you've mentioned are, primary ones. There is also a vendor called AtScale.
I, found the approach that Snowflake and Databricks are taking also very interesting. So if you think about semantic layer for the perspective of dbt and q, for example, these are a layer that is getting it runs its queries and execution on still on top of the data lake and data warehouse. You have decoupled it out of BI, but it is and and now you put that onto the more, I guess, transformation layer or or some type of, one more place where you still have to define the YAML file and define, those on a third party on top of the, you know, data warehouse or lake house that you have. What Snowflake is doing is first, they started with the YAML file as a semantic model to be fed into the Cortex analyst. And now they are moving towards allowing users to have a semantic view that contains all of this semantic, you know, components of semantic layer. And it's actually a view that you can create out of the YAML file, the previous YAML file for semantic models. But it's a view that can be queried and acts and lives like another tables and views in your schema. And I thought that was also really interesting because it's a lot more native in that case, and it also follows the same security measures you have, the roles and permissions and dynamic masking and tagging if you have, and that gets all applied natively. So I thought that was a really interesting approach. Now semantic view right now for the timing is under public preview. So it's very early, but I think there is a really optimistic promise there if, you know, you can have that natively on the data warehouse side, and then you can also just point your BI dashboards to run on top of that. And for Databricks, they also have, I I don't think that there is a public documentation on this yet because it's so early, but they have a Unity catalog of metrics that looks very similar to also something like Semantic View. But I think they've also recognized the need of SemanticLayer on top of Unity Catalog in order to power Databricks Genie for AIBI.
So it will be really interesting to see what that would look like and how that, really impacts the AI analyst performance. One part that I haven't seen or but is is an I could I think another worthwhile mentioning is companies like Qube. One of the added benefits that I've seen from the semantic layer, and also I think one of the more of the reasons why DVT brought on transform as, and then it has a metrics layer as a proprietary feature. It's because there is once you have that default, once you have almost like, or like, Olaf cube, you can potentially cache these calculations or aggregations, things that are happening because it's defined that you know that it's gonna get queried. And you can make those queries a lot more efficient, you know, but because it's already defined that way. And I think that is almost right now added benefit when you have semantic layer like like Qube today, but should be more of the table stakes. And one of the reasons why more companies should consider having semantic layer in their tool chain, I would say.
[00:34:11] Tobias Macey:
One of the things that you mentioned earlier as far as the work that you're doing when working with systems like Tableau or Power BI is that you said that because you're able to ingest the metadata from those business intelligence systems, you can reverse engineer some of the business logic that goes into the underlying dashboards, etcetera. And I'm curious how you're thinking about that as a potential on ramp for data teams who want to more explicitly define these semantic models in a separate technology layer that is divorced from the data visualization system so that they can expose it to more use cases and expose it to more of these LLMs, etcetera?
[00:34:51] Shinji Kim:
Yeah. That is definitely the main use case that we're trying to support. So today, we are actually providing the Yamo file to the end customers so that they can modify it as they want or put it on their GitHub. The part that I would say, like, the role that we initially play today is to give you that quick start and bootstrap the semantic layer without having to do any hard work of looking at all the physical models to figure out what the metrics should be, dimensions should be, and also the filling in the details to make it work. The part where we see is going is as we start putting more support on providing this semantic layer YAML file in many different formats, whether that's for DBT DBT semantic layer, cube, or Databricks, so on and so forth, is continuing to also provide a system that can maintain and update the semantic layer portion.
So this part, I think, is something that is important. I think in the BI side, one of the parts that you end up running into is multiple different teams defining their own KPIs and metrics definitions that are different from one another. So there's a discovery aspect of people may not be aware of the metric already defined, but also it's also because they might not have used it in the past. So that is one of the roles that we see that we can also help with for our customers. So today, when we are generating metrics, we will indicate to the user, hey, there is another metric that looks exactly the same. It might even within your same dashboard, They are named a different, but underlying measure and the dimension exactly the same. So would you like to combine these or have them as, this name, and here is the documentation? So that's kind of like something that we saw, really adding value because this really allows customers to understand the ways that, kind of metrics and how the data that has been used on the business side proliferated on its own, and it gives them the opportunity to really organize them and having a good single source of truth of how they want to represent the business side as well.
[00:37:06] Tobias Macey:
And in your work of building Select Star, working with some of your customers, helping them to bootstrap these semantic models, what are some of the most interesting or innovative or unexpected ways that you've seen that new single source of truth, the semantic representation applied either with or without AI use cases?
[00:37:26] Shinji Kim:
Yeah. In the beginning, any customers that were using semantic models or implemented the semantic layer, we've seen the production use cases primarily for embedded dashboards, but not for their core analytics. It's really with the AI analyst that more companies are considering to, have the semantic layers for their core data mart and, analytics use cases. The part that c and, the ones that we start envisioning with some of our customers is how this, like, semantic models to be utilized for not just for the AI analyst that they've defined, but for AI agents that they are building so that the agents or could be related to, like, the MCP client. How this can be embedded in their application layer is another way that, we are, starting to see that customers are ideating towards. But the core, I would say the use case, it has been, you know, answering business questions. And just on that, I think there's a lot that can be touched that we've seen customers that wanted to and are deploying kind of like their revenue off teams or marketing teams to start using Cortex analyst instead of having to always go to their analyst mix teams directly, as well as for companies that are, like, more running, like, a retail chains for their branch managers to be able to check out the how their store is doing and being able to ask more strategic questions on, what they could do differently or better specifically for, that, location that they are managing, for example.
[00:39:09] Tobias Macey:
And in your work of building Select Star, working with end users and data teams who are tackling the semantic modeling challenge, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:39:23] Shinji Kim:
I think two things. One is, like, starting with the model and getting the first POC, whether that's through forward embedded dashboard case or for just testing out if, like, you know, your first set of business questions get answered. That part is easy. Now the scaling side and, actual governance and maintenance, like, as your business models change, as you are starting to add more new datasets or creating more models underneath. How do you actually govern and maintain those? I think that is a a real challenge that is upcoming. I think that related to that, I think the second part that I was also going to mention is just like any data modeling, I think the semantic model can also you can fall into this trap of over modeling the data and that can be a big time suck. So being able to actually put it in practice and continue to iterate would be is an important part because we've also seen some of our customers that have implemented a semantic layer in the past. They've spent a lot of effort to model everything.
But, like, how much of their model is actually being leveraged and used today by AI or from their querying? You know, not %. And then their data team got very tired of, like, you know, keep having to update it, or some of them just weren't, like, fully updated. So once it goes out of date, then it's not also relevant anymore. So I think that that's one main area that we believe that having more of a systematic way to continuously update semantic model file. I think if it fits a file, I we feel like it's it was a little bit harder if it's if it can be a system or view or if there are ways to to rule API or ways to update that model, I think is is a much better way. I think as an industry, that is just kind of a we we have some ways to go.
[00:41:19] Tobias Macey:
And for teams who are starting to think about tackling this body of work, what are the situations where you would say that building a semantic model or investing in a semantic layer are the wrong choice and
[00:41:33] Shinji Kim:
the reward is not going to be worth the cost. I think it's always comes down to one, is there, use case that physical model currently cannot solve? Because, like, it doesn't matter how well you document your data, you're still not getting the right, you know, SQL query back, from your AI, then I think that's definitely one of the reasons why to implement semantic model. If not, then, you know, if physical model is clean and simple enough, I think then you you don't really always have to do this. Yeah. Otherwise, it's really the yeah. Where the end usage would really come from to make it worthwhile, which goes back to, yeah, can you test it out, with the end users that's gonna leverage this? And I think the use case, you know, perspective, like AI, I think should definitely be in mind. I just haven't seen other use cases that truly gives you the the ROI of doing the semantic model implementation beyond that, just because data teams are always very busy getting their work done as well.
[00:42:32] Tobias Macey:
You mentioned some of the work that Snowflake and Databricks are doing to incorporate semantic modeling more closely into the core experience of the underlying warehouse engine. I'm wondering what you see as the potential future for this concept of semantic modeling becoming more of a native construct of the underlying data layer or compute engines and maybe any other areas of expansion or ecosystem investment that you would like to see?
[00:43:05] Shinji Kim:
Yeah. I think that's a good way to put it. Like, this having that native construct on where the data exist, I think it is a very interesting movement, as an industry. We'll have to see how well that gets improved as more customers adopt it. The other big part that I would love to see and that I predict it will need to happen is more of this automated operational way to update the semantic models. Especially if I think about how semantic models will look like in the future when the end user, primarily the AI agents and the AI applications, not humans. Today, semantic models are fully designed to be consumed and operated and edited by humans, not, AI agents. So a way to self update the model as well as maintain it so that it is more of a system instead of a YAML file probably is where it needs to go. Now the other piece that I think would be also important is the integration outside of the semantic model itself. So a lot of semantic layer companies, like, one of the biggest part that they provide is this, yeah, centralizing, consolidating, doesn't matter which source and destination you're talking about, like, you know, you should be able to query it using same DSL or natural language. I think the integration between the BI tools, where the business logics all live, and and the user interaction, and how that translates and impacts and self updates the semantic layer side, would be kind of like more idealistic future that I I would envision.
[00:44:46] Tobias Macey:
Yeah. I I think that the fact that semantic models as a technology have so far largely been manifested as these collection of YAML files is well, it has a fairly low barrier to entry. I think that it also speaks to the lack of maturity in the space where that is effectively the only way that you can represent them, and they are not a a core concern of the underlying data data systems. They're more of a a bolt on addendum, and I think that I would like to see them become more of an integrated piece of the actual core compute capability. That's right. Yeah. Well, that's very well said. Are there any other aspects of semantic modeling, the ways that it empowers more of these AI systems and AI use cases or the work that you're doing at SelectStar to help teams address these complexities that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:42] Shinji Kim:
I think anyone that's actually considering to implement it or the use of AI will, really need to make sure that they are also not just considering the definition of the tables, columns, and, you know, dimensions or metrics, but the other aspect that we also found very helpful and important to add into the semantic layer implementation were first the relationships of the entities or relationships between the columns, as well as the sample values of and the synonyms. These are things that initially as on the metadata perspective, we weren't sure how much it impacts the model. But for AI perspective, just these are things that really made a big difference on, the quality of the results that we've gotten. So I wanted to kind of share that as a quick tips on whoever that wants to go run a semantic layer for AI use. And I also say this because some of the comments I've got in the past when I've talked about this on LinkedIn was that, oh, like, I tried it but didn't really work. AI analysis isn't that good anymore. And I would say that's mainly because you really would have to look at, well, how your semantic models look like. If the quality of the semantic model determines, there's a huge range of, what AI will do based on how complete your semantic layer is.
[00:47:08] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and the ways that you're helping to address this challenge of semantic modeling. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:30] Shinji Kim:
The data management and data governance side, a lot of the things that we have now seen now being in the business for more than five years. We see the, biggest gap after getting the good amount of top tooling is always around, like, actually implementing it and operationalizing it, putting it into processes, and having everyone to actually leverage the tool. And I think this is an area that I cannot say that this is like a segment of the tool that we need in the market, but more so in the area of where AI agents and how internal teams can really make the tooling embedded more on to day to day of the employees and, everyone that they work with. So that's kind of like what I would say.
[00:48:16] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences and hard won lessons about semantic modeling, particularly in this context of AI systems and the ways that AI is being applied to the challenges of data analysis and end user use cases. I appreciate, all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thanks so much, Tobias. This was great. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details. Your host is Tobias Macy, and today, I'd like to welcome back Shinji Kim to talk about the role that semantic layers are playing in the era of AI. So, Shinji, can you start by introducing yourself for anybody who hasn't heard your past appearances?
[00:00:59] Shinji Kim:
Sure. Well, thanks for having me back here, Tobias. Really excited to chat with you again. So hi, everyone. I'm Shinji Kim. I'm the founder, CEO of SelectStar. SelectStar is a automated data discovery and governance platform, for cloud data warehouses, data lakes, and pretty much all your data ecosystem. That's kind of what we do primarily, and, yeah, I think through data engineering podcast, we've been chatted about overall, like, needs of data discovery where it can be applied for both data democratization and data governance in the past. And, yeah, I'm excited to dive into, more of the other use cases that we are starting to find with this world of metadata and metadata management.
[00:01:46] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:50] Shinji Kim:
Yes. It was a long time ago. Back in 02/2007, I was a data scientist at or, at the time, the title was statistical analyst at Sun Microsystems. I built a mark of chain Monte Carlo models for sales forecasting, and also, kind of like almost like a interactive dashboards that sales and operations teams can use for projecting their quarter by quarter sales models. That's kind of how I got into data. And, I guess the rest is history. I've worked with, number of startups as a product manager as well as a software engineer. In the past, I started a company prior to Select Star called Concourse Systems focused on distributor stream processing, which was acquired by Akamai, back in, 2016 and, started Select Star, five years ago. I would say the the biggest part that kind of got me to where I am, especially related to data discovery and data governance and data democratization, is because I've noticed a lot of companies, especially enterprises, spending a lot of effort and resources on collecting, storing, and running computes on data, building up all the data lakes and systems and, buying all the tools. But the end users, whether that's a data analyst or product managers, folks that wants to gain answers from their data or build a new product on top of the data, usually spend end up taking weeks to find the right data that they could use for those purposes. And this is why I, started SelectStar five years ago and, has been the main capability, that's been, driving the core of our product, which is around data discovery, providing the context, of data and your data ecosystem.
[00:03:44] Tobias Macey:
And the last time we talked, data discovery, metadata management, those were very active topics. There There were a number of different companies and open source projects entering the ecosystem around that time, particularly because of the rise of the modern data stack and the number of different tools that were being brought in to work with data, the variety of data sources that were being brought into the context of data warehousing and business intelligence. And, obviously, the term modern data stack has faded from use, and a number of the companies that started around that time have either changed focus or been acquired or they've ceased to be in operation anymore.
And now there's the age of AI that is adding new stressors to the different data stacks that teams are running as well as the requirements around contextual information, data discovery. And I'm curious what you've seen as some of the main shifts in the ecosystem and in your business over the period since we last spoke.
[00:04:47] Shinji Kim:
Yeah. That's a great question and also a big question. A lot of things has happened in the last, two years in the data ecosystem and the specific area that, we play in. First and foremost, we do see a lot of data teams that have been, I guess, more efficient in their operations in terms of their, operational perspective, tooling, as well as how some of the customers that we have worked with just over the last two years, they have really gone from very centralized data team to more decentralized data team. And we've also seen the other way of the shift as well, kind of more decentralization to more hybrid. So just overall, like, we are seeing a more shift from the phase between the data engineering teams, analytics engineering teams, and then data analyst slash BI teams, how they work together.
We are starting to see that. This is also driven by lot of advancements and new, features that are being added and, being more consolidated on the platform side. So both Snowflake, Databricks, DBT, a lot of these platforms are starting to provide a lot more capabilities than the specific data warehousing or the transformation capabilities, such as documentation, data catalog, lineage. A lot of these are starting to get merged into as, those platforms also as features. So there is definitely the market shift from independent vendors, you know, and, for the, companies having the best of breed tools versus, getting a, more of a platform support, I would say, has definitely come up number of times. I would say, though, in terms of, like, just kind of, like, bubble it back up on where we stand at Select Star. From the beginning, we believe that, providing this type of single source of truth of your metadata, how your data is being created, being transformed, and being utilized within your organization is not something that you should only get the information from one platform. Most of the companies use multiple platforms and would require the cross platform visibility, a way to manage and gain insights cross platform as your whole data ecosystem. So we are starting to provide a lot more capabilities where you can truly manage and govern those information across platform and then also sharing, certain metadata from one platform to another. So we can be the glue or the bridge, and you as an end user are just have to work with one platform on the metadata management perspective. So that's one part of it. The other big part of it that I guess I missed out left out here was the AI.
So a ton more services and products that I see in the market that are from, like, AI analyst or AI, data engineer to a lot of, I guess, features around copilot, type of things that we are definitely seeing a lot. From SelectStar perspective, we also had a lot of updates along with AI, including automatically documenting all of your data assets without you having to lift up the finger. But you can also merge and easily approve what is the right documentation to providing with you, an AI assistant that can do semantic search, answer any questions that you might have your on your data, create SQL queries, or editing your SQL queries that you're trying to execute but may have, some block head in and especially with, you know, AI really getting better at understanding natural language and then being able to also do more, direct translation, to the code and SQL that, we use day to day. We are starting to I am starting to see this expansion for how data can is starting to get leveraged even beyond the data teams, with this trend. So this is definitely one area there that we are starting to see a lot of traction and, have a lot of features that, we've built towards supporting this true data democratization and self-service analytics, to enable, more people to use data.
[00:09:15] Tobias Macey:
Semantic modeling as a concept and as a term has gained a lot of attention, particularly around four or five years ago with the growth of the modern data stack and the idea that you wanted to have one canonical source of truth for the key business metrics that previously lived in the BI system, and now you wanted to be able to use across different data clients or data consumption use cases. And there were also a number of overlapping terms around that where there was the semantic layer, the metrics layer, headless BI. And I'm wondering what you see as the overlap across those terms and whether there's any notable distinction between them as far as the actual application of those ideas.
[00:10:04] Shinji Kim:
I guess to start with the motivation side of semantic layer, like you mentioned, there is the part around, oh, like, let's, we can virtualize all the data sources and you can use one thing to query data. I think that is definitely still there and there are a lot of companies that does this. Not necessarily through always through a semantic layer, but ways to just translate SQL, even because most of the time those queries are primarily designed for physical data querying. I would say with modern data stack, the part that has really bubbled up is defining that single source of truth metric. When you say revenue and when I say revenue, are we actually talking about the same number? Do we actually get the same result? And I think that's definitely one of the reasons why a lot of companies wanted to implement or, have a semantic layer. Now today, and what I've been seeing in the last, I would say, you know, three to six month or so, is starting to really leverage a semantic layer for AI analysts or AI agents to be able to provide a better analysis or text to SQL from the business perspective. So pre semantic layer, the LLMs, and, you know, we've seen this a lot in Select Star. Come customers ask a very, semantic, type of question, and we give them SQL query to run, and that's all great. But it gets you so far as more of, like, maybe 75 to 80%.
And that's mainly because and and we've thought about this a lot. Like, why is that what is missing? We have all the metadata of the customer, and we also know, like, out of, let's say, 10 different orders table, which table is the right one to use and which column is the right one to use because we can see the previous query history, to determine which one are being utilized the most by whom, their query patterns, so on and so forth. The part that we I've noticed is that as you get to the business level questions, there's a lot more nuance underneath the question that's not always defined in the SQL layer or the physical data layer. These are logical layer and the, more of what do we mean when we say active in active users. When somebody asks a question, give me all the contracts that are pending, where what what is what is the total number of contracts that are pending, in in this quarter? Now for LM to understand and define whether they should get contracts pending from a defined column called contracts pending versus if they need to look at a status column and, do a aggregation based on contracts pending value. Those are very, like, very direct and simple reasons why having some of this definition and formula underneath semantic layer can really be helpful because those are not necessarily defined. And most of the time, when you build your data warehouse and your data physical models, you're thinking about reusability of data and the ways that data can be joined and quarried together, not necessarily, sufficing every single business questions that could be answered and, and driven on that side. So this is a kind of a finding that I had recently.
Why semantic model is an important layer that, you need to have if you want to, invest in having an AI analyst that can generate and execute queries behalf of business users. Sorry that got a little bit long, but you also asked asked about what's the difference between semantic layer, metric layer, headless VI. What does that even mean? I see them all as like different, but, you know, similar. It's all around that area of semantic models. The way that I think about it is that semantic models are, kind of like the logical model data. Most of the time, this is entity based models. And entity based models is different from, when you do, like, physical entity modeling or Kimball modeling. It is, on semantic model, it is a lot more important to have the ways that things can be, like, things are named in in a unique manner as well, because each field you will have to define, let's say, not even just like a not a joint condition or primary key, foreign key. Sometimes you might want to define, whether this is, you know, one to many or many to many, like that type of relationship. But that's just the model side. Semantic layer would be the implementation of those models. So this can be done on dbt semantic layer or cube or at scale. I think that those are all semantic layer companies and tooling anyone, can use to implement semantic models. And then I think you also mentioned metric layer. I think metric layer is almost like the same as semantic layer, but maybe focused on just calculated metrics. So you can see primarily the core KPIs and the ones that always have some aggregation only. I've, like, recently found out, that, like, Snowflake's definition of metrics in their semantic view or semantic model YAML files, it is mandatory for any metric to always have an aggregation. If not, then it should be a measure with an expression, for example. So I think there are some people may, get into, that difference as well. And then last but not least, headless BI. I feel like I've heard a lot more about headless BI during the modern data stack era, but headless BI is really just the, BI capabilities like, you know, querying or exploration of data analysis, but really exposed as a API instead of having a, you know, visual UI that's, like, really tied to it. I think the biggest part about headless BI that I think about is that it can still be a BI, but there is a very clear separation of concern of the visualization layer versus the the data layer with, the semantic models.
[00:16:24] Tobias Macey:
And the other interesting aspect of all of these concepts of semantic modeling, semantic layer, headless BI, etcetera, is that data warehousing and business intelligence as a use case and an industry goal have been around for decades at this point. There are many established patterns for doing the data modeling, such as star schema, data vault, etcetera, as well as methods for being able to gain better performance in the form of things like OLAP. All of those are intended to be built around the business entities, business objects, business concerns. And I'm curious what are the differences that are added on or the new capabilities that are introduced by virtue of using these semantic models or semantic layers that sit on top of the data warehouse and one level above the core star schema, data vault schema, etcetera? Yeah. I would say the core difference that semantic layer and semantic modeling really brings on
[00:17:29] Shinji Kim:
is the business logic perspective of how, metrics should be calculated and what specific dimension, time dimensions, those can be applied to. It is technically doable from building star schema or others too, but what we see is that it's still on the set of physical data models that happens and the naming of things or the way that things get joined. It usually requires some level of aggregation or filtering because the end dimension or the end measure is a representation of some type of business, KPI, or metric. So that that's really the main difference that that I, would define it. And then there is the part around entities. So if we think about, like, data modeling, data modeling really comes from understanding, first, there are lots of the sources that we define, and then we may pull some of those sources source data to build the original Viscom data models so that it can be joined and queried together to get the answers. At the same time, entity models are, for semantic layer perspective, usually come from how the businesses, look at data. So it's starting from what is the revenue and then just as an example. And within revenue, how do we define? Are we doing it based on, like, region or product line? And what are some of the exceptions that we might have to put in? And it it's driven by the business process and the reporting, that you may need to do. So it's approaching from the other end of the spectrum, and that's why the entity, usually end up looking a little bit different than the physical models, where, it gets designed from. I've definitely seen companies that have mainly without necessarily a quote unquote separate semantic layer like a YAML files.
They've implemented their own views and tables and call that a this is kind of our semantic model, and these are the semantic model tables that we should use for BI purposes and so on and so forth. And now bringing us back around to the role of AI in all of this
[00:19:53] Tobias Macey:
modeling, the additional layers, the ways that we need to think about the presentation and access of data. What are some of the ways that building these semantic models helps when brought into the context of LLM use cases, whether that is the English to SQL transformations where everybody wants to be able to just talk to their BI or being able to use your existing data assets to power things like rag use cases where you want to be able to feed the appropriate contextual information to the LLM at request time and some of the ways that you need to think about the additional attributes that you want to feed into your semantic model when you are building it with AI in mind? I think an easy way to think about semantic layer
[00:20:42] Shinji Kim:
for AI is to think of it as it's like almost your configuration and guardrails that you're providing the AI to know how certain things should be calculated rather than letting it infer from all the raw metadata and queries that you might have had in the past. I think that's the main difference. If you have those metrics dimensions, some expressions of those relationships defined within the semantic model, it is a semantic layer. It's a lot easier and straightforward, for AI to just refer to that. And I'm not saying that it's impossible for rack systems and, having just the AI to run on top of your metadata. And there is, you know, without semantic layer and with if you already have a really clear data model, you may not actually need semantic layer. But most of the time, you're working on top of very messy data where you get to a point where you're not sure which tables should be the certified tables for AI to query because you're not sure if you've included all of them. And either you might be feeling like, oh, I'm missing something or I am including something that might introduce some noise. So semantic layer just gives you that it's more clear direction and guardrails. Almost like when you're using Checkatrade, if you provide Checkatrade with the you know, if if I'm asking it to, write an email or paragraph or blog post, if I give them more structure or the purpose or the audience who I'm writing it for, it will give me a lot better result. So I see semantic layer playing that type of role for AI analysts.
We've definitely from Select Star, as we were testing semantic layer for, specifically for AI analysts, and we've done a ton of iterations with Cortex analysts on Snowflake in particular. We've seen a lot of step change differences in terms of quality. When you provide the semantic model, especially when you can provide the more complete semantic model that has not just the metadata, but also the relationships, which are the primary keys and foreign keys, what's the joint conditions actually look like, as well as which are the dimensions, time dimensions, metrics, aggregations, how does that happen, and sample values. Things like this really make the end result to be a lot more accurate and shows up as you really intended than, leaving the AI to do as is. I think there is definitely some benefit of having just the rack system as well, but it gets to this, you know, when you are trying to get from 80% to 9598%.
Some of these, having these definitions, will, play a big role.
[00:23:43] Tobias Macey:
And then another angle to the question of RAG and semantic models is the fact that virtually every database at this point has added some sort of vector storage capability and vector indexing, obviously, at varying levels of capability and varying feature sets. But I'm curious how you've seen that additional data type incorporated into the semantic definitions and ways that you're using that as an additional avenue for LLMs and AI systems to be able to access the or or as a starting point for accessing the different semantic models for then understanding, okay, either, yes, this is exactly the model I want, and the k nearest neighbor search or HNSW gave me the thing that was most applicable, or, hey. I landed on this model. But now that I see the additional contextual information, it's actually not what I want, and I need to start over.
[00:24:44] Shinji Kim:
Yeah. I think that's a very interesting point, and it is possible to, I guess, generating the semantic layer from the vector model that your database is providing. At the same time, the part that we found that the most accurate and interesting is when you are start if you can bring on any usage data, historically, and whether that is coming from primarily so analyst select queries or from the usage of the BI side. So I think that is definitely one way to get there. But again, most of the time, what defines semantic layer, which is primarily the business logic side, is not always really parsable from just looking at what the metadata that databases have only.
[00:25:32] Tobias Macey:
And then the other aspect of building semantic models is that it is a non zero effort required to actually create and maintain them. It requires a certain amount of contextual knowledge about the business, the specifics about what the data means, how it's being applied, and some of the cases where it can be misapplied. And I'm curious how you're seeing teams incorporate LLMs into that discovery and development piece of the puzzle for being able to actually accelerate the creation and reduce some of the manual toil involved with the maintenance of those definitions.
[00:26:11] Shinji Kim:
Yeah. For sure. And I think the tooling is is really starting to improve in this perspective as well. But even just by using cloud or Chatrapety, we've seen teams actually, you know, building their semantic layer just by feeding in how they are defining their metrics as well as, like, along with the list of tables and columns that exist to kind of output the, first, you know, version of Yamo files that they, wanted to create. So I think it's constructing the models and also, maintaining those models. It's is it possible. The part that I would say is, really hard today is kind of the continue the maintenance of the models as well as having those models to stay true as the models get used, which requires just like any, you know, AI agents or applications, may, require the monitoring, and evaluation, that needs to happen along with the versions of the semantic models that that you're implementing. But the part that we see as kind of like a really, a good place to start, and this is the part that, we've been spending a lot of time on for helping our customers implementing semantic models, is starting from the models that you semantic models that you might have already implemented within your BI tool. Within your BI tools, the ones that are your certified dashboards, dashboards that, your end users, business stakeholders are using, a lot of them are already connected to the critical data elements of data warehouses, data lakes. It it kind of, will give you the subset of the tables that should be part of the semantic layer. I think that's a really great place to start because you will also see that part of those tables are, you know, well maintained, has, like, the quality checks built in. And so if you can, like, feed in the LOM with, like, those list of, metadata, I think that is a good place to start. And then and then you can build upon that as you look at other dashboards or you're updating it as these dashboards boards also evolve on the business side.
[00:28:36] Tobias Macey:
The other piece of the equation when dealing with the semantic layers is which underlying technology you should use for actually being able to maintain and expose the models that you define where most of them are using some sort of YAML definition for this is the SQL query that translates to this particular metric. Maybe these are the parameters that can be fed in to aggregate along different axes, etcetera. And there are different engines to be able to actually execute those. You already mentioned cube is one of the leading ones. Also, it's being marketed as a headless BI system. You know that the folks at SOTA Data have a semantic layer capability.
Obviously, DBT, has acquired a, a metrics layer, and they've incorporated that as part of their product suite. I'm curious what you see as some of the useful heuristics and selection criteria for teams who are starting to evaluate how am I actually going to build and expose these semantic models.
[00:29:40] Shinji Kim:
Yeah. That's a great question. I would say the first of all, the BI side of the house, most of the time, does have some type of semantic models. Their their own semantic layer that is proprietary, so it's just hard to get those logic out. But I think there are some BI tools that allow you to export it or, can at least for you to, like, maintain and see it. Like, you know, LookML is that case, hex as a way to, like, implement or accept other semantic models. And then for Power BI and Tableau, like, these are things that you will have to stitch in together, but, like, through the API, like, you should be able to get the business logic that are built on underneath. And that's primarily kind of how at SelectStar, we are, reverse engineering, decent semantic models and business logic from the BI tool as well. Now if we're talking about the third party semantic layer, yes, like the the ones that you've mentioned are, primary ones. There is also a vendor called AtScale.
I, found the approach that Snowflake and Databricks are taking also very interesting. So if you think about semantic layer for the perspective of dbt and q, for example, these are a layer that is getting it runs its queries and execution on still on top of the data lake and data warehouse. You have decoupled it out of BI, but it is and and now you put that onto the more, I guess, transformation layer or or some type of, one more place where you still have to define the YAML file and define, those on a third party on top of the, you know, data warehouse or lake house that you have. What Snowflake is doing is first, they started with the YAML file as a semantic model to be fed into the Cortex analyst. And now they are moving towards allowing users to have a semantic view that contains all of this semantic, you know, components of semantic layer. And it's actually a view that you can create out of the YAML file, the previous YAML file for semantic models. But it's a view that can be queried and acts and lives like another tables and views in your schema. And I thought that was also really interesting because it's a lot more native in that case, and it also follows the same security measures you have, the roles and permissions and dynamic masking and tagging if you have, and that gets all applied natively. So I thought that was a really interesting approach. Now semantic view right now for the timing is under public preview. So it's very early, but I think there is a really optimistic promise there if, you know, you can have that natively on the data warehouse side, and then you can also just point your BI dashboards to run on top of that. And for Databricks, they also have, I I don't think that there is a public documentation on this yet because it's so early, but they have a Unity catalog of metrics that looks very similar to also something like Semantic View. But I think they've also recognized the need of SemanticLayer on top of Unity Catalog in order to power Databricks Genie for AIBI.
So it will be really interesting to see what that would look like and how that, really impacts the AI analyst performance. One part that I haven't seen or but is is an I could I think another worthwhile mentioning is companies like Qube. One of the added benefits that I've seen from the semantic layer, and also I think one of the more of the reasons why DVT brought on transform as, and then it has a metrics layer as a proprietary feature. It's because there is once you have that default, once you have almost like, or like, Olaf cube, you can potentially cache these calculations or aggregations, things that are happening because it's defined that you know that it's gonna get queried. And you can make those queries a lot more efficient, you know, but because it's already defined that way. And I think that is almost right now added benefit when you have semantic layer like like Qube today, but should be more of the table stakes. And one of the reasons why more companies should consider having semantic layer in their tool chain, I would say.
[00:34:11] Tobias Macey:
One of the things that you mentioned earlier as far as the work that you're doing when working with systems like Tableau or Power BI is that you said that because you're able to ingest the metadata from those business intelligence systems, you can reverse engineer some of the business logic that goes into the underlying dashboards, etcetera. And I'm curious how you're thinking about that as a potential on ramp for data teams who want to more explicitly define these semantic models in a separate technology layer that is divorced from the data visualization system so that they can expose it to more use cases and expose it to more of these LLMs, etcetera?
[00:34:51] Shinji Kim:
Yeah. That is definitely the main use case that we're trying to support. So today, we are actually providing the Yamo file to the end customers so that they can modify it as they want or put it on their GitHub. The part that I would say, like, the role that we initially play today is to give you that quick start and bootstrap the semantic layer without having to do any hard work of looking at all the physical models to figure out what the metrics should be, dimensions should be, and also the filling in the details to make it work. The part where we see is going is as we start putting more support on providing this semantic layer YAML file in many different formats, whether that's for DBT DBT semantic layer, cube, or Databricks, so on and so forth, is continuing to also provide a system that can maintain and update the semantic layer portion.
So this part, I think, is something that is important. I think in the BI side, one of the parts that you end up running into is multiple different teams defining their own KPIs and metrics definitions that are different from one another. So there's a discovery aspect of people may not be aware of the metric already defined, but also it's also because they might not have used it in the past. So that is one of the roles that we see that we can also help with for our customers. So today, when we are generating metrics, we will indicate to the user, hey, there is another metric that looks exactly the same. It might even within your same dashboard, They are named a different, but underlying measure and the dimension exactly the same. So would you like to combine these or have them as, this name, and here is the documentation? So that's kind of like something that we saw, really adding value because this really allows customers to understand the ways that, kind of metrics and how the data that has been used on the business side proliferated on its own, and it gives them the opportunity to really organize them and having a good single source of truth of how they want to represent the business side as well.
[00:37:06] Tobias Macey:
And in your work of building Select Star, working with some of your customers, helping them to bootstrap these semantic models, what are some of the most interesting or innovative or unexpected ways that you've seen that new single source of truth, the semantic representation applied either with or without AI use cases?
[00:37:26] Shinji Kim:
Yeah. In the beginning, any customers that were using semantic models or implemented the semantic layer, we've seen the production use cases primarily for embedded dashboards, but not for their core analytics. It's really with the AI analyst that more companies are considering to, have the semantic layers for their core data mart and, analytics use cases. The part that c and, the ones that we start envisioning with some of our customers is how this, like, semantic models to be utilized for not just for the AI analyst that they've defined, but for AI agents that they are building so that the agents or could be related to, like, the MCP client. How this can be embedded in their application layer is another way that, we are, starting to see that customers are ideating towards. But the core, I would say the use case, it has been, you know, answering business questions. And just on that, I think there's a lot that can be touched that we've seen customers that wanted to and are deploying kind of like their revenue off teams or marketing teams to start using Cortex analyst instead of having to always go to their analyst mix teams directly, as well as for companies that are, like, more running, like, a retail chains for their branch managers to be able to check out the how their store is doing and being able to ask more strategic questions on, what they could do differently or better specifically for, that, location that they are managing, for example.
[00:39:09] Tobias Macey:
And in your work of building Select Star, working with end users and data teams who are tackling the semantic modeling challenge, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:39:23] Shinji Kim:
I think two things. One is, like, starting with the model and getting the first POC, whether that's through forward embedded dashboard case or for just testing out if, like, you know, your first set of business questions get answered. That part is easy. Now the scaling side and, actual governance and maintenance, like, as your business models change, as you are starting to add more new datasets or creating more models underneath. How do you actually govern and maintain those? I think that is a a real challenge that is upcoming. I think that related to that, I think the second part that I was also going to mention is just like any data modeling, I think the semantic model can also you can fall into this trap of over modeling the data and that can be a big time suck. So being able to actually put it in practice and continue to iterate would be is an important part because we've also seen some of our customers that have implemented a semantic layer in the past. They've spent a lot of effort to model everything.
But, like, how much of their model is actually being leveraged and used today by AI or from their querying? You know, not %. And then their data team got very tired of, like, you know, keep having to update it, or some of them just weren't, like, fully updated. So once it goes out of date, then it's not also relevant anymore. So I think that that's one main area that we believe that having more of a systematic way to continuously update semantic model file. I think if it fits a file, I we feel like it's it was a little bit harder if it's if it can be a system or view or if there are ways to to rule API or ways to update that model, I think is is a much better way. I think as an industry, that is just kind of a we we have some ways to go.
[00:41:19] Tobias Macey:
And for teams who are starting to think about tackling this body of work, what are the situations where you would say that building a semantic model or investing in a semantic layer are the wrong choice and
[00:41:33] Shinji Kim:
the reward is not going to be worth the cost. I think it's always comes down to one, is there, use case that physical model currently cannot solve? Because, like, it doesn't matter how well you document your data, you're still not getting the right, you know, SQL query back, from your AI, then I think that's definitely one of the reasons why to implement semantic model. If not, then, you know, if physical model is clean and simple enough, I think then you you don't really always have to do this. Yeah. Otherwise, it's really the yeah. Where the end usage would really come from to make it worthwhile, which goes back to, yeah, can you test it out, with the end users that's gonna leverage this? And I think the use case, you know, perspective, like AI, I think should definitely be in mind. I just haven't seen other use cases that truly gives you the the ROI of doing the semantic model implementation beyond that, just because data teams are always very busy getting their work done as well.
[00:42:32] Tobias Macey:
You mentioned some of the work that Snowflake and Databricks are doing to incorporate semantic modeling more closely into the core experience of the underlying warehouse engine. I'm wondering what you see as the potential future for this concept of semantic modeling becoming more of a native construct of the underlying data layer or compute engines and maybe any other areas of expansion or ecosystem investment that you would like to see?
[00:43:05] Shinji Kim:
Yeah. I think that's a good way to put it. Like, this having that native construct on where the data exist, I think it is a very interesting movement, as an industry. We'll have to see how well that gets improved as more customers adopt it. The other big part that I would love to see and that I predict it will need to happen is more of this automated operational way to update the semantic models. Especially if I think about how semantic models will look like in the future when the end user, primarily the AI agents and the AI applications, not humans. Today, semantic models are fully designed to be consumed and operated and edited by humans, not, AI agents. So a way to self update the model as well as maintain it so that it is more of a system instead of a YAML file probably is where it needs to go. Now the other piece that I think would be also important is the integration outside of the semantic model itself. So a lot of semantic layer companies, like, one of the biggest part that they provide is this, yeah, centralizing, consolidating, doesn't matter which source and destination you're talking about, like, you know, you should be able to query it using same DSL or natural language. I think the integration between the BI tools, where the business logics all live, and and the user interaction, and how that translates and impacts and self updates the semantic layer side, would be kind of like more idealistic future that I I would envision.
[00:44:46] Tobias Macey:
Yeah. I I think that the fact that semantic models as a technology have so far largely been manifested as these collection of YAML files is well, it has a fairly low barrier to entry. I think that it also speaks to the lack of maturity in the space where that is effectively the only way that you can represent them, and they are not a a core concern of the underlying data data systems. They're more of a a bolt on addendum, and I think that I would like to see them become more of an integrated piece of the actual core compute capability. That's right. Yeah. Well, that's very well said. Are there any other aspects of semantic modeling, the ways that it empowers more of these AI systems and AI use cases or the work that you're doing at SelectStar to help teams address these complexities that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:42] Shinji Kim:
I think anyone that's actually considering to implement it or the use of AI will, really need to make sure that they are also not just considering the definition of the tables, columns, and, you know, dimensions or metrics, but the other aspect that we also found very helpful and important to add into the semantic layer implementation were first the relationships of the entities or relationships between the columns, as well as the sample values of and the synonyms. These are things that initially as on the metadata perspective, we weren't sure how much it impacts the model. But for AI perspective, just these are things that really made a big difference on, the quality of the results that we've gotten. So I wanted to kind of share that as a quick tips on whoever that wants to go run a semantic layer for AI use. And I also say this because some of the comments I've got in the past when I've talked about this on LinkedIn was that, oh, like, I tried it but didn't really work. AI analysis isn't that good anymore. And I would say that's mainly because you really would have to look at, well, how your semantic models look like. If the quality of the semantic model determines, there's a huge range of, what AI will do based on how complete your semantic layer is.
[00:47:08] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and the ways that you're helping to address this challenge of semantic modeling. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:30] Shinji Kim:
The data management and data governance side, a lot of the things that we have now seen now being in the business for more than five years. We see the, biggest gap after getting the good amount of top tooling is always around, like, actually implementing it and operationalizing it, putting it into processes, and having everyone to actually leverage the tool. And I think this is an area that I cannot say that this is like a segment of the tool that we need in the market, but more so in the area of where AI agents and how internal teams can really make the tooling embedded more on to day to day of the employees and, everyone that they work with. So that's kind of like what I would say.
[00:48:16] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences and hard won lessons about semantic modeling, particularly in this context of AI systems and the ways that AI is being applied to the challenges of data analysis and end user use cases. I appreciate, all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thanks so much, Tobias. This was great. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Welcome
Shinji Kim's Journey in Data
Evolving Data Ecosystem
Semantic Modeling and Its Importance
AI and Semantic Models
Choosing the Right Semantic Layer Technology
Applications and Challenges of Semantic Models
When Not to Use Semantic Models
Future of Semantic Modeling
Conclusion and Final Thoughts