Summary
In this episode of the Data Engineering Podcast Serge Gershkovich, head of product at SQL DBM, talks about the socio-technical aspects of data modeling. Serge shares his background in data modeling and highlights its importance as a collaborative process between business stakeholders and data teams. He debunks common misconceptions that data modeling is optional or secondary, emphasizing its crucial role in ensuring alignment between business requirements and data structures. The conversation covers challenges in complex environments, the impact of technical decisions on data strategy, and the evolving role of AI in data management. Serge stresses the need for business stakeholders' involvement in data initiatives and a systematic approach to data modeling, warning against relying solely on technical expertise without considering business alignment.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Serge Gershkovich, head of product at SQL DBM, talks about the socio-technical aspects of data modeling. Serge shares his background in data modeling and highlights its importance as a collaborative process between business stakeholders and data teams. He debunks common misconceptions that data modeling is optional or secondary, emphasizing its crucial role in ensuring alignment between business requirements and data structures. The conversation covers challenges in complex environments, the impact of technical decisions on data strategy, and the evolving role of AI in data management. Serge stresses the need for business stakeholders' involvement in data initiatives and a systematic approach to data modeling, warning against relying solely on technical expertise without considering business alignment.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Enterprises today face an enormous challenge: they’re investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern. And that’s especially evident in large, complex enterprise data environments. That’s why companies like DirecTV and Pfizer rely on SqlDBM. Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy. In today’s cloud era, that backbone needs a modern approach built natively for the cloud, with direct connections to the very platforms driving your business forward. Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale. SqlDBM ensures enterprises don’t just move to the cloud—they maximize their ROI by creating governed, scalable, and business-aligned data environments. If global enterprises are using SqlDBM to tackle the biggest challenges in data management, analytics, and AI, isn’t it worth exploring what it can do for yours? Visit dataengineeringpodcast.com/sqldbm to learn more.
- Your host is Tobias Macey and today I'm interviewing Serge Gershkovich about how and why data modeling is a sociotechnical endeavor
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing the activities that you think of when someone says the term "data modeling"?
- What are the main groupings of incomplete or inaccurate definitions that you typically encounter in conversation on the topic?
- How do those conceptions of the problem lead to challenges and bottlenecks in execution?
- Data modeling is often associated with data warehouse design, but it also extends to source systems and unstructured/semi-structured assets. How does the inclusion of other data localities help in the overall success of a data/domain modeling effort?
- Another aspect of data modeling that often consumes a substantial amount of debate is which pattern to adhere to (star/snowflake, data vault, one big table, anchor modeling, etc.). What are some of the ways that you have found effective to remove that as a stumbling block when first developing an organizational domain representation?
- While the overall purpose of data modeling is to provide a digital representation of the business processes, there are inevitable technical decisions to be made. What are the most significant ways that the underlying technical systems can help or hinder the goals of building a digital twin of the business?
- What impact (positive and negative) are you seeing from the introduction of LLMs into the workflow of data modeling?
- How does tool use (e.g. MCP connection to warehouse/lakehouse) help when developing the transformation logic for achieving a given domain representation?
- What are the most interesting, innovative, or unexpected ways that you have seen organizations address the data modeling lifecycle?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with organizations implementing a data modeling effort?
- What are the overall trends in the ecosystem that you are monitoring related to data modeling practices?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- sqlDBM
- SAP
- Joe Reis
- ERD == Entity Relation Diagram
- Master Data Management
- dbt
- Data Contracts
- Data Modeling With Snowflake book by Serge (affiliate link)
- Type 2 Dimension
- Data Vault
- Star Schema
- Anchor Modeling
- Ralph Kimball
- Bill Inmon
- Sixth Normal Form
- MCP == Model Context Protocol
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into week long success stories.
Enterprises today face an enormous challenge. They're investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern. And that's especially evident in large, complex enterprise data environments. That's That's why companies like DIRECTV and Pfizer rely on SQL DBM. Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy. In today's cloud era, that backbone needs a modern approach built natively for the cloud with direct connections to the very platforms driving your business forward. Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale.
SQL DBM ensures that enterprises don't just move to the cloud, they maximize their ROI by creating governed, scalable, and business aligned data environments. If global enterprises are using SQL DBM to tackle the biggest challenges in data management, analytics, and AI, isn't it worth exploring what it can do for yours? Visit dataengineeringpodcast.com/sqldbm, that's sqldbm, to learn more. Your host is Tobias Macy, and today I'm interviewing Serge Gershkovich about how and why data modeling is a socio technical endeavor. So, Serge, can you start by introducing yourself?
[00:02:16] Serge Gershkovich:
Sure. Serge Greshkovich. I am a data modeler, data architect by trade for most of my professional career, and I'm also the head of product at SQL DBM, which is a online database modeling tool. It's a SaaS product that runs in the browser. So data modeling is is very near and dear to my heart, and I am also a Snowflake data superhero, big fan of of Snowflake as as you can tell, and everything that they're doing. And, yeah, that's that's me in a nutshell.
[00:02:53] Tobias Macey:
And do you remember how you first got started working in data?
[00:02:56] Serge Gershkovich:
Kind of by accident. I think the way most people end up in data, nobody nobody tells their their parents at the at a tender young age that I wanna be in data when I grow up. But I started with technical degree with a computer science background, just kind of landed my first internship at, at the data side of of a of a company and and just kind of stayed there. It was it was interesting. It was challenging in a way that programming, I'm sure, has has different challenges, but that's where I ended up. It was in the SAP ecosystem to start, and I spent over a decade, in in SAP, first, BW, then HANA, then at the business intelligence suite along alongside that. So eventually decided once I saw Snowflake that I never wanted to touch another SAP system again, and I've lived up to that promise.
I'm very, very proud of that. And and, yeah, eventually just kind of shifted gears to product management.
[00:04:01] Tobias Macey:
And so in terms of the overall space of data modeling, that is a term that can be used across a large number of contexts and use cases. And I'm wondering if you can just start by giving your definition of the activities that are involved in that overall practice of data modeling.
[00:04:23] Serge Gershkovich:
That's a great question because I want to first maybe start with what it isn't and maybe dispel some some common myths that might persist for for those that are unfamiliar or not used to doing it on a on a regular basis or doing it systematically as part of a project or part of the data strategy at your current organization. So if I could kind of summarize it in in one phrase or image, it would just be a conversation between a business stakeholder or a subject matter expert and a data team. So data engineer, data architect, business intelligence team, analyst.
It's understanding what the business does and encoding that semantic information in such a way that our data models and our data structures are built in such a way that accurately capture those business requirements and the needs of the business. So we can, of course, broaden the definition from there. But at the core, I think if people walk away with anything, it's being able to align with your business stakeholders in in a way that everybody's on the same page and have some kind of semantics, whether those semantics are visual, verbal, technical, or preferably all three that everybody can understand and just share common talking point, common reference point around.
[00:05:52] Tobias Macey:
And on the flip side, you mentioned some of the things that it's not. And I'm wondering if you can talk to some of the overall buckets of misconceptions that you've seen people come into the overall practice of data modeling with where maybe they have a very narrow view of the set of tasks and responsibilities involved, or they have a different area of focus or a different set of objectives as far as what that data modeling practice is supposed to achieve and maybe some of the ways that that can derail the overall exercise.
[00:06:29] Serge Gershkovich:
Sure. I think the biggest one that it's it's not is is the misconception that it doesn't exist or it's optional. Joe Rees famously said that the lack of a data model is still a data model, albeit a crappy one. And this misconception has its roots in probably in the fact that most of us that work with data do so on the analytics side. So So we're looking at it not in the transactional system, but in the data warehouse where you could make the claim. We might not be broadly speaking formally modeling something. We might just be getting data delivered, landed in our raw layer, and then just working our way up to the transformations up to reporting. So, we're not formally creating ERD diagrams. We're not formally discussing our designs and the relationships that our dimensions have with our facts. But by by definition, we have a data model. It just we don't own it. It comes cascaded down to us from the source system, where, of course, you can't you can't have the sort of chaos you might find in a data warehouse in a transactional system because sales wouldn't get recorded, bookings wouldn't get made, etcetera. So there's absolutely a data model involved in everything you do. Our CRM systems, our master data management systems, they all have a a data model. And even if that data model lives in your head and if your data landscape is small enough for you to effectively keep it there, that doesn't mean that the data model doesn't exist. So that's the the biggest misconception.
The other one is that a data model is kind of, secondary. It's something that we do post fact. So we we build. And then, again, with there's as much as I love DBT, but they have a lot of blame in this misconception that you can build, you can write SQL, and you can effectively see Taz your way to a data model. And if you need to document something, then describe that DDL and and off you go. And this is what I why I started with fundamentally. It's much easier to to check something out on on paper before you you actually build the thing where you write lines of code and you commit business logic, you commit loading and testing and all of these resources deployment just to find out that you couldn't align on on the fundamental formula or an understanding of something. And the point I'm I'm making is that data modeling is just a a sanity check. It's if when it's done right in the order of talk to the business, get the requirement, encoded in something very simple, just a an ERD, for example, and then go off and build. Because the chance that you've made a mistake after two pairs of eyes, especially the the business owners or the subject matter experts have had a look at it, is dramatically smaller. And, you know, even DBT has the guardrails in place where you can absolutely enforce data contracts and things like that that prevent you from even building something that doesn't meet your original spec of data types required, number of columns, etcetera, references, and things like that. And the analogy I like to use is sometimes the like, just the map. A data model is effectively a map of where you're trying to go in your data strategy. And that is I'll come back to the transformational side of, of data modeling. Let's put a pin in that one. But fundamentally, it's a map. And do you need a map? Well, the answer is sometimes, not always. That if you're standing in front of, let's say, a shopping center and it's just a strip mall, you don't need to consult the directory because you see all five stores, you walk into the one you're looking for. If you're in an enormous shopping center, then, you know, you might probably use a map just to find your car in the parking lot. And you wouldn't just wander around aimlessly if you're looking for something very specific. It would be in your best interest to spend two seconds consulting the directory, walking directly to where you need to go and getting that information.
And unfortunately, that's another of the drivers that make people a lot of times overlook data modeling is the fact that just like in this analogy, if you walk into a shopping center and you just start wandering around, start asking strangers, you will eventually get to where you want to go. And what this looks like in the business world is let's take our previous example. If you're a data architect that's dealing with a data landscape that's effectively small enough to fit inside your head. Imagine it's a startup, it's a small company, but the company grows, your data sources grow, your data assets grow, your business units grow, your business teams and data teams grow.
The problem is you start to feel the symptoms. You start to feel the pain before you've identified that a problem exists because it feels quotidian. Before, I used to know everything. Now I didn't know, so I asked my colleague next to me. Tomorrow, he didn't know either, so we called a meeting with the business stakeholders. Tomorrow, the business stakeholders couldn't even really give us an answer, so we guessed and we built something. Turns out it was wrong, so we went back to design and we rebuilt it. All good. But the problem is it's like a a frog in boiling water that it was okay, and the the pain points are all incremental. So it feels like it's fine. Nothing's broken. But fast forward a year, year and a half, data quality issues, lack of trust, we're blaming, and we're getting blamed instead of collaborating with those business stakeholders like we're meant to. We're not delivering value because even when we get it right, the data is suspect.
And how did we get here? It wasn't overnight, but it was gradual. And this is another pain point that if you know to look for the red flags and the the warning signs, then you can avoid this doomsday scenario. And, unfortunately, it's one that I think anybody that has worked in data has seen some figment of this, whether it's full blown or just just starting because, again, no no data strategy is perfect. But these are just tools that help people orient and help understand what the company is doing and what it is we're capturing about what the company is doing and then presenting that to the decision makers.
[00:12:58] Tobias Macey:
Another point that you raised in terms of this overall space of data modeling is, in particular, the source systems that you're pulling data from, but also data modeling is not something that is the sole domain or dominion of the data warehouse environment. You also need to incorporate data modeling in exercises such as streaming ingest or if you're doing any data distribution to things like applications or other contexts or if you're trying to incorporate unstructured or semi structured data assets either into a warehouse environment or an application context, you still need to understand what is the actual purpose and semantics of that data.
And I'm wondering how you see that recognition of the other modalities and localities of data being brought into that overall conversation of data modeling, and what bearing does this piece of information or this collection of information have on the overall business domain model?
[00:14:03] Serge Gershkovich:
Great question. Because I would say that I I don't see it as or at least as much as I would like to. And just like myself, many people end up in the data world without a formal background. So I've seen many people come even from the business side that they they were the data expert on their team because they were managing the big Excel that nobody knew how to make sense of. And they eventually kinda get formalized into the data team and get taught the fundamentals of SQL, and then then they're off and running. But many people, for example, don't know that our even what we call a database today is is founded in relational theory, which is mathematics.
So the technology that we're using, of course, all of the cloud compute and caching and storage, there have been massive innovations there. But fundamentally, what is why do we call it a relational database? Because it is grounded in relational theory, which goes back to the sixties and seventies at Gurkad that he codified all of this and eventually that gave rise to the database as we know it today. Concepts like rows and columns. Like, that was revolutionary in at some point. And when all of this was getting codified, transformational modeling, and I'll I'll and like I said, I'll get back to what I think that term should encompass, just was nascent or almost didn't exist. The fact that you would be analyzing data across systems and not just within one system was not yet a thing. And today, when we're looking at the contrast between a transactional system where all of our models for, largely speaking, can be expressed in that classical definition of entities, relationships, dependencies, primary and foreign keys.
That is not so much the case on the warehousing side, where the OLAP system has a different type of dependency. So it's not so much a formal parent child relationship, but it's it is formal in the sense that if the table I'm pulling from tomorrow is missing the column I'm trying to pull, then the pipeline is gonna break. So this is a transformational dependency. It's not to say that the dimension is a conformed dimension from parent to child, but you have a data pipeline that depends on all of these things coming together. So what are you trying to solve on on the OLAP side and OLTP side? On the OLTP side, you're just trying to make sure that everything is as close to to normal as possible and that your data is consistent. It's not duplicated.
And you're able to basically maintain it in such a way that avoids all types of errors, dependencies, and, and update errors, anomalies, etcetera. On the data warehouse side, you know, we can almost guarantee that if we built something upstream that our data is pretty much clean when it lands. Of course, not always the case, but let's just go with it. On the data warehouse side, you have, as you mentioned, you have multiple systems. You have conformed dimensions by definition. You have transformational dependencies. So now let's unpack that. What does that mean? That when the data lands, it's quite dirty. It needs to be renamed. It needs to be cleaned up. It needs to be consistent in such a way. Then there's all the different methodologies that you could be using. And we'll park that for a later question, but let's just go with the general. You have, let's go medallion or staging, reporting, whatever you wanna use. You have a raw layer that is just data as is. You have a staging layer where it's a little bit more conformed and normalized, but still not overly aggregated. And then you have a reporting layer that is maybe denormalized completely in one big table if you wanna go to that extreme. But regardless, you have pipelines that transform the data across those three layers. So you have dependencies around selection, and you have you have quality issues in terms of, are you calling the same field by different names along different parts of that process?
You have the fact that you can have multiple processes working on that data and creating, let's say, the the sales view and the marketing view of our trading information. And now by definition, you have kind of the core is duplicated, and then some of the logic around it is specific to a certain department. Which one is right? Can I pull yearly totals from either one, or have they been distorted by business logic that's specific to each business unit? So now you run into all sorts of problems that didn't exist on the transform on the transactional side. So how do we manage that? So the word a lot of these things, for example, if you take a view, what is a view doing? It's applying logic, SQL logic to a dataset and giving you a different kind of result.
Unfortunately, the word logical has already been co opted. So informal modeling, generally, we start with a conceptual model, which is very basic. Then we go to the logical layer, which is similar to physical, but not really tied to any type of physical database that makes it easy to understand and also then transform to a physical database or multiple physical instantiations. And then physical, where you're actually describing the the particulars like clustering and partitions and column data types, things that are that are gonna help you effectively store the data. Where on the logical side, you're not so much concerned about storage. You're concerned about nuances, business nuances of is this a subtype of this other type, or how does a customer place an order? Can a customer must a customer place an order to even be a customer? Things like that that you're trying to iron out.
And what is a good word to call everything that happens in a data warehouse? I am trying to go with transformational. That in my intro, I forgot to to mention that I'm also an author. So I have a book called Data Modeling with Snowflake, where after many conversations with folks, because of the nature of my role at SQL DBM, People kind of understand the need, the importance, or they've just heard the term data modeling. They want a a little bit of a foothold because, like we said earlier, maybe they didn't have a formal background in it. People would come to me and say, can you recommend some reading material? And, honestly, I was struggling. Like, I wasn't gonna throw CJ date at them because they were, you know, they just came from the business world. And I thought, let me write something that is just practical and to the point of what it is, why do it, how it's gonna help you, and here's some basic formulas. That's why I wrote the book, Data Modeling with Snowflake specifically, so that users can actually not just read about a type two dimension, but actually take a script that is optimized for Snowflake architecture and create one and see what it looks like. And of course, I I didn't stop there. I because even a type two dimension, there's there's transformations taking place. So it's important to kind of bring this all tie this all back together. It's important to understand that the transformational component of modeling is real, is important to document, to understand, to codify in a way again, it looks more like data lineage rather than an entity relationship diagram.
But it's still at the heart of it, one entity being connected to another entity, not as a relationship, but as a dependency, for example. Because those are the types of questions that come up in the data warehousing context of, I'm seeing this pretty dashboard, but where is this information coming from? How do I trust it? Where how do I trace it back to the source? And of course, there's a whole ecosystem of monitoring tools, lineage tools, catalogs. And it could be a little overwhelming if it's done first of all, if it's done at all. Second of all, if it's done, is it being done on somebody's prerogative of let's just buy something just so I can claim we have an installation of such a thing? Or is it actually being implemented to answer practical business questions of lineage? And is it up to date, etcetera? So very long answer to actually a very complicated problem. So I hope I kind of covered the the edges.
[00:22:48] Tobias Macey:
No. I think that that that's a great summary of that problem space and digging into some of that technical transformational aspect where you were mentioning things like type two dimensions, conformed dimensions, dimensional modeling. When you're in that context of building the warehouse as that is the effectively digital twin of the organization, one of the stumbling blocks or points of analysis paralysis that can often come up is deciding on what is that dimensional approach that I'm going to take. Am I going to do the Kimbell style star schema? Am I going to do data vault? Am I going to go and explore anchor modeling? Do I just throw up my hands and do the one big table approach? And I'm wondering how you're seeing teams try to maybe leapfrog or short circuit some of that analysis paralysis and just start doing some building and discovery to then be able to build the feedback loops that help to inform exactly which approach to take or exactly how to implement whichever approach they're leaning towards?
[00:23:54] Serge Gershkovich:
That's a great question, and I think I have a a very unique perspective on it that is that could be very helpful because especially in my role now as as head of product at SQL DBM, not only do I work on the product side, but I naturally also meet with our our customers. And not just customers, but people who are early in the funnel. Maybe they're just trialing the tool. And I get to to see not just one use case as I would, let's say, as an architect working with a company long term or even as a consultant working on a project over the course of six months to a year. But I get to speak with dozens of people a week, a month, and I get to see I get to touch base with them over the those increments of first week, first month, first six months, a year. And I think that the failures, the the use cases that that didn't pan out are just as telling and informative as as the ones that did. So one of the things that I I I kind of saw from this vantage point is one of the counterintuitive things I would say is technical prowess counterintuitively, is not a guarantor of success for data initiatives.
Very early on, and this caught me off guard, that I would be speaking to people who some of them, like, literally would brag about having met Kimball and Inman and and having been taught by them. So initially, naively, I would I would think, surely, they know what they're doing. And then fast forward six months, I see that they're not really using the tool. Fast forward a year, they're not renewing because their plan basically fell apart. And what went wrong is they effectively said, well, I'm the architect. I just need I'm the one that's gonna be doing the heavy lifting. So I'll just need a license for myself, and I'll I'll get her done. Problem. They are not involving the business.
So they're not asking for viewer licenses. They're not asking for a an embed of this into their their wiki or their confluence that as they're building, somebody can the business can look on and sense check what they're doing. So that's number one. That the number one biggest mistake I see, the biggest red flag to a project that is likely to flounder is lack of understanding that the data team does not own the concept of what it is they're building. They own its implementation. They don't necessarily know or by design, they don't know as well as their business colleagues the intricacies of the things they're supposed to be capturing.
Number two is I think you kind of hinted at this. A lot of times, people want the the awards. They want the the accolades on their resume. They want the the stripes on their epaulettes of I implemented data vault because they heard about it. They saw it succeed in other context, or I did anchor modeling. There are people who understand anchor modeling. They are comfortable in six normal form and God bless them. There are people that just kind of read the textbook and are, you know, ready to go all in and risk their company's data strategy on it. And that's another problem of wanting to do something for the sake of doing it, I would say, is the second biggest reason I see implementations fail where there's not the need just isn't there to justify that amount of complexity.
And just because sometimes it's an ego problem that you might have somebody, like, a really talented data architect, seasoned veteran, certified, what have you, but not their team. So their team might be relatively junior to them. So they can't maintain or they make mistakes that the architect, wouldn't have made that they're not able to effective it's holding them back more than it's accelerating what they're doing. Of course, at a certain point, you know, you can't survive without Data Vault or something to handle 50 source systems. You're not gonna try to do that by hand.
And the last one I would say is not having is kind of the the opposite of what I previously said. The previous example is going to the other extreme of not having a control over the fundamentals. And when I say fundamentals, I really mean fundamentals. So, when I said, what is data modeling? It's a shared concept of semantics because semantics communicate a density of information that words sometimes when there's lots of them simply cannot. So those semantics could be visual. We're not recording, but I like to show a slide of just a list of tables that you would see when you log into any database. You open the database, open the schema, you see a list of tables.
Even when it's like 10 tables and they're properly cleanly named customer product sales, it might still take, us even, an expert, you know, some data wrangling to understand, okay, is a location a customer location or is it a supplier location? A simple diagram with just boxes and lines can solve that problem visually, instantly. Other semantics are verbal. Earlier, we we said the words type two dimension. That's three words that to many of you likely symbolize so much more than than what is what is in those three words. For example, you know you can expect the granularity of that table that it's not just gonna be the business key. You understand that it's likely gonna have columns similar to from and to date, if not others. You maybe even you know the pseudo code required to load one effectively.
So these are the things that we expect that when I say type two dimension, you don't just go scratching your head and building something random. Same thing goes for business concepts of when I say active customer, that's a that's a popular one because nobody has an accurate definition of what an active customer is, or there's probably multiple of them. And unless we know how to steward that that knowledge, that context to say, okay. We used to have just one definition. Now we have three departments that have their own definitions. This is the the company standard. This is the marketing standard, etcetera.
These are the things that can when two people are kind of looking at each other and nodding and not necessarily thinking the same thing, this is what leads to problems. So back to the the point I was making that the other issue is, again, a lack of fundamentals that they drive they're they're principles that drive success. They're just best practice patterns of knowing you don't have to necessarily model in Kimbell, but knowing what a fact versus a dimension is is very helpful just in trying to describe what type of information a certain table, contains.
So some of these things are foundational, and a lot of times, like I said, they're they're absent from from the vernacular. So that's why I started with the very first question of what is data modeling. It's not necessarily an ERD diagram. It's the minimal effective dose of consensus that moves us forward as a team.
[00:31:28] Tobias Macey:
So we've been narrowing in from the overarching purpose through the incorporation of the business concepts and doing the organizational modeling, trying to map that to a technical representation where the general idea in data warehousing is to serve as some sort of digital representation of the real world manifestation of the business and its processes so that you can then do some analysis and reporting and maybe prediction on the overall functioning of the business. And that inevitably involves the technical details that we've been touching on and that those technical details can be the dimensional approach that we've been discussing. It can be the specific database engine. It can be the ways that the data is represented or stored, whether that is in a lake or a warehouse or as free text, etcetera.
And I'm curious what you're seeing as the most significant impacts that the technical decisions tend to have in either accelerating or hindering the capabilities of the organization to actually implement that digital twin representation of the business.
[00:32:49] Serge Gershkovich:
It's a mixed bag because technology either accelerates our progress or just allows us to to fail faster. If it's complicating more than it's adding value, then it's doing the opposite of what it's meant to do. So there's plenty of use cases for everything that you mentioned, starting, let's say, with, semi structured data that we've seen relational kind of set the groundwork. But sometimes, especially when we're dealing with IoT, with sensor information, with logs, they're semi structured, or we're actually using our data lake house or data data lake to our full potential that has unstructured data.
These are just ways of encoding, being able to quickly process the the information as as it's being loaded. But how do we ensure that things inside these, nested relationships are are accurate? So our brains are not great just like we're not great at visualizing exponential growth. We're not great at visualizing nested data, especially once it goes beyond one or two levels. And, in fact, one of the exercises in my book is I set out to prove why why the schema on read approach failed. And it's not because it's slower. It's not because modern systems can't can't handle semi structured data just as well as structured. It's because is does not imply ought fundamentally.
That it's actually the same problem with relational, but the the problem with that further compounds with semi structured is once you tear it apart, you realize that there are multiple interpretations of it, that you just have certain properties under parent items, but you don't necessarily know, are they their own entity or how many things you could possibly have. And also if that specific feed, let's say, it doesn't contain a certain property, you don't know that the entity even exists. So how are you meant to verify this with your business stakeholders just to get a conceptual grasp of, of of what you're looking at? And when we're talking about everything, everything that is possible today, which is, you know, there's so much stuff coming at us in terms of what we can possibly process and how we can process it. I just by, I guess, my my experience and and my background, I'm I'm not a data engineer, so I'm not gonna weigh in on what is most efficient, what is most cost effective.
I kinda limit myself to going back to the the understanding of we can process anything that we set our our minds to process, but how do we make sure that we're dealing with information that we understand and processing it in the correct way? No matter what the format, if I go to the example I used in my book is just pirates and pirate ships, things people love, not to not to get overly technical. It doesn't matter that it's in a JSON format. What does matter is I can't bring JSON to my business team or my captain, in this case, and say, hey. Can you, assess the crew off of this JSON file?
I need to show them something basic, just two boxes that says this is your your captain. This is your ship. A pack a captain has a ship. A crew can have these weapons or cannot, etcetera. Is this correct? And that's something that they can very quickly confirm or or deny. And for me, I think that's I'll limit myself to just saying, use every tool in your toolkit, but make sure that you also have a reference point to understanding the information and sense checking within that information with people who are less technical than you might be.
[00:36:46] Tobias Macey:
And then inevitably, we have to address the elephant in the room that is AI and the impact that it is having on this overall practice of data modeling, technical representations of the business processes, etcetera. And given the broad applicability of LLMs, I can imagine them being used across each of the different stages of the data modeling life cycle from translating documents, you know, business policies, conversation, transcripts, etcetera, into some summarized form of these are the things that we care about. This is what the semantic representation should be for this particular process, doing the actual SQL generation or schema generation or ERD generation, etcetera. And I'm curious how you're seeing the introduction of LLMs into this overall process and life cycle both accelerate and potentially derail that overall effort?
[00:37:50] Serge Gershkovich:
Alright. A lot to unpack there. Let's let's start kind of on first of all, by acknowledging that, like you said, LLMs are ubiquitous. AI is here to stay. And if you're not keeping up, you're gonna get left behind like never before. It's going to be, like, as just as you mentioned, it's gonna be there along the way at every step of of your data journey from the way data is generated to the assistance that help you along along the way, whether it's creating descriptions or creating the the data model itself, to making predictions about how that data asking business questions and predictions around that data. Let's kind of go with the theme of the previous question of AI as a tool. How do we make sure that the tool is working with us and not against us?
The lessons of previous iterations and previous business, hypes should not be forgotten. So we just we hearken to schema on on read, which was the debacle of the February or so where big data was supposed to solve everything. Data lake was gonna be the new new best thing. We don't have to do any modeling. We don't even have to do any transformation. The system will figure it out. Well, we know how that ended. AI is also promising the system will figure it out. And in this case, it actually has a chance of of succeeding at that. However, what is an LLM? What is AI at at heart? It is a pattern based prediction generator.
I'm oversimplifying, but let's just run with this definition for now. If we're going to ask AI to answer a business question, understand a whole volume of of context all in a single go, wouldn't it make sense for us to make sure that it's doing the least amount of guessing and interpretation possible? Meaning, wouldn't we try to clean up the dataset before we tried to make inferences out of it? The difference between AI today and things we were doing, let's say, in the early twenty tens is is cost. That even though transactional costs have been falling, storage costs have been falling, that has been the trend. And AI has just flipped that on its head, meaning the cost of being wrong has just gone up. That it's not a matter of reloading a multi terabyte table using an ELT that we can live with.
Retraining a multimillion dollar model that we cannot live with. Heads will roll. Therefore, I would encourage people not to to believe the hype, but not to dismiss the magic, the magical thinking of I can just roll my put my hands up and AI will do it for me. So in this regard, data modeling, again, not drawing data entity relationship diagrams, but cleaning up your damn data set is more important than it's ever been. I've heard plenty of examples where people that have kind of started doing semantic models and and AI models around data, they immediately bump up against the fact that you can't build a semantic model without having a pretty good understanding of your conformed physical model.
Very quickly, it exposes the duplications, the misalignments, the double names, and it's all a part of the same process. It you can kind of it's a spectrum of how how detailed versus how general it is that you're talking about, whether it's customer generally or customer ID specifically in your physical system. If you have customer and you have client across two different tables, what do you expect the result will be when when you run AI analysis on that? So these are the kind of basic things that, you know, you don't have to be an AI engineer to understand that you you just have to go back to the drawing board and make sure that your fundamentals are are correct. But now to the other question of where else does AI come in along that journey? Well, here we open up the doors to the realm of possibility.
All of us have examples in our lives where AI has come in and fundamentally systematized or taken a lot of drudgery out of our our day by making it quick and easy. So if you need AI to help you get to to that point of conforming your your dimensions, cleaning up your datasets, or even flagging the the parts of the data landscape that are problematic. AI could be very good even in serving your AI strategy, kind of cleaning up itself before running more more system intensive and cost intensive processes. Again, I'm not gonna say I'm not gonna list every step of the way where AI can step in and make your life easier because those are those are infinite. What I will say is what I what I started with that you have to have a, by this point, reflexive instinct to see where can AI help.
If you're trying to solve a problem if you're faced with a problem and your first instinct is, how do I manually figure this out? Not reflexively speaking, where can AI make this easier? You're going to get left behind. You're just gonna make yourself obsolete. But again, once that is in place, how is AI gonna help you clean up your data? Then once once you've gotten to a place where you're comfortable throwing AI resources at something, you've already ensured success, basically, as much as you can by making AI's job as easy as possible.
[00:43:55] Tobias Macey:
And to your point of you can't just throw an LLM at the problem and expect it to solve everything without it having some means of grounding itself in the context and semantics of the organization where you highlighted that trying to build that semantic representation for the LLM to feed off of, at least on the consumption side, can help highlight some of the shortcomings in your existing data estate. And one of the recent developments that has been accelerating rapidly is the introduction of tool use and in particular, model context protocol servers. And I'm wondering how you're seeing that inclusion of tool use as a means of building that grounding in the business context and business realities helps to provide guardrails and improve the effectiveness of the incorporation of LLMs into each of the stage of the life cycle from ideation and definition through to consumption and exploration?
[00:45:00] Serge Gershkovich:
That's a good question because MCP is still relatively new and data teams, they span the gamut of There are those that just operate on SQL. They haven't even really dipped into the Python realm of data analysis. And then there are modern, teams that are comfortable with calling an LLM using an MCP server. But they're still relatively few. And especially in the data space, one, I would say we're still early. For example, if you're staying up to date with some of the innovations of the cloud data platforms like, Snowflake and and Databricks, they're introducing features that just boggle the mind for even someone like me that has, you know, come up through the ranks of analysis and data architecture that you can effectively write a prompt in line in SQL. So you can select column one, concatenate with column two, and then in your third column, put a prompt to say, you know, analyze the sentiment of of the string that I've concatenated this. So you're effectively for every row in your in your result, you're seeing an LLM response.
And as we said, with great power comes great responsibility. So these are you know, you're not just gonna be spending your regular compute tokens on this. If you're making mistakes there, it's gonna get expensive. And also the use cases are infinite in terms of, you know, I might have to have just to get some kind of basic sentiment analysis, I might have had to do, like, some gnarly regex just to to tease out is this positive negative keywords. Now AI can pretty pretty confidently tell me, like, this is a positive review. This is a negative review. So it does make life easy. Some of it is, you know, provided by by the platforms themselves.
A lot of teams are, as you mentioned, they're running their own models for security reasons, for performance reasons, privacy reasons that that they can effectively hook into using MCP or just regular API calls. And because they have use cases that demand it. Again, I'm not really the best person to ask at this stage in the game just because I'm currently not hands on with any specific data problem. I'm I'm working on a tool that that accelerates peep the life of people who are. What I do here is specifically for the advanced use cases from from customers that I interact with. And for them, the you know, there's the technical challenge, kind of the same thing we we just discussed. There's the technical challenge, but there's also the challenge of making sure that once they once they write those pipelines, that the core understanding of what it is they're analyzing is is there for them.
So it's almost like the old is new again. That sure, you can have infinitely new and advanced processing systems on top of your old columns and rows. But the fundamental understanding of what is a customer and what is an active customer and where is the table and what are the synonyms for customer, all of that has to be encoded, unfortunately. Otherwise, you're setting yourself up for failure and failure that, you know, you're gonna blame AI. You're gonna say this isn't smart enough when, you know, your business users probably can't even figure this out if you have so many inconsistencies. That if if you have to, you know, the context has to live somewhere that we started with the with Joe Reese's famous quote. The data model is still a data model, albeit a crappy one. If it's not systematically encoded, let's say, in a semantic model, it's living in your head. And if it's living in your head, it's not available for AI to consume and orient itself. And that's gonna be a problem and an inexpensive one.
[00:49:00] Tobias Macey:
In your work of helping organizations approach that overall challenge of data modeling as more than just a technical exercise, What are some of the most interesting or innovative or unexpected ways that you've seen them develop and implement the overall strategy for that? Okay. That's a great question because,
[00:49:18] Serge Gershkovich:
again, from my unique perspective, there are people who are hands on with data in the trenches. There are people who are hands on with the engineering side of things. I am not a consultant, so I'm not coming in and and actually, you know, but just by design for privacy reasons, I'm actually not looking at what our customers are are building unless they're willing to share that with me. What I do see are the frameworks and the technical demands around how they operate. And, again, if you're a consultant and you've seen hundreds of use cases, maybe some of this won't surprise you. But as I've seen SQL DBM grow from just a modeling tool to a enterprise modeling platform. I've seen the needs of the customers evolve from, again, just basic, can I document something? Can I draw the relationships?
Do you support all of the database objects and properties natively? Once the tool evolved beyond that point, the types of, requests we were getting is, you know, it's not one project, it's multiple data domains, and we wanna communicate. We want to make sure that people are consuming our data products. So it's things like being able to reference, first of all, data discovery. How do I discover the fact that a certain data source or a certain data domain even exists? How do I reference it without copying it? So the fact that we're an online platform means we make it very easy for for users to take things like take things like entities from one project and create a a pointer or a reference or read only copy in another project and model downstream from that. The other thing we're seeing is, and I'm I'm not exaggerating, that we have data architects that oversee dozens, if not upwards of a 100 or more data teams. So some of our features, like global standards, are ways where a single data architect or a data team can enforce things down to their their child projects. So what are the conventions we're gonna follow? What are the templates we're gonna use? What are how are we gonna name things consistently?
Not only is it important for enterprise teams to define that once and only once, but also to make sure it's not just suggested, but enforced. So we have tools on the guardrails to make sure that happens. Once you get to a use case of that size, you know, you would expect you're kinda thinking openness. But a lot of times, it's openness with guardrails. So a lot of companies have when especially when they provide data as a product or data as a service, they have internal teams that are encouraged to share, but share to a point. So it's not even about making something accessible. It's making something accessible and making sure that people only see the parts of it that you want them to see. So there's also privacy and our back and restrictions even at the data model level. Another part is security and compliance.
So everything that we're building, it's easy to forget that it's it's not just data, but there are laws around the data. So it's easy to forget until you get slapped with a million dollar fine. So things like that are also very important of not just data itself, but the metadata, the the tagging, the PII flags, the stewardship around, is this a trusted table, or is this a staging area? Can I consume it? All of these things, the the more mature a organization becomes, the more, you know, even these things cannot live in in somebody's head. They they need to be systematized. They have to be fed in through the same CICD process that deploys your object is the same thing that, you know, stamps the metadata tag saying, this could be promoted up to the reporting layer, and this cannot, or this should have certain tags and and policies applied to it to make sure that we're not accidentally exposing it to people who shouldn't have visibility.
So for me, the the evolution of just simply, how do I document what is what is one table? What are its columns? What are the columns mean? All the way up to how do I search across an organization where we have upwards of a 100 of data teams or more? Each of those data teams have projects or data domains that have thousands of of tables, and each of those has definitions and relationships. How do we secure it? How do we share it? How do we find it? Those are the main challenges. How do we search across everything? And those are, you know, those those are unique and interesting challenges to solve. You know, there's plenty of people working on on the engineering side and the data modeling side is is equally important, especially once you get past a certain size as as an organization.
[00:54:31] Tobias Macey:
In your experience of working in this space, talking to customers, coming to grips with all of the different details and nuance around data modeling as that socio technical exercise, what are some of the most interesting or unexpected or or unexpected or challenging lessons that you've learned personally?
[00:54:47] Serge Gershkovich:
From my perspective, I think the most interesting thing is just the wide gamut of methodologies and standards that people employ. Even as an architect of nearly two decades in the industry, you know, I've only seen what I've seen. I've been a consultant, so I've seen kind of some of the accelerated pace of, product development. I've worked at companies as a data architect kind of full time. So I've just gotten to know one industry or several industries in-depth. But having, again, dozens of people a week that I speak to from our client side, I have seen every type of standard imaginable.
Learning to kind of design for the extremes, I guess, in in my role has been very important. Not confining myself to the best practices because we certainly want to guide people to best practices, but also respect that, you know, they might have their own ways of doing things. They might also have their own set of exceptions that even when we set rules, we wanna also be able to grandfather certain legacy things into their design. That's number one. Number two is the degree of data maturity really matters. That as a data architect before joining SQL DBM, I think I might have been a little bit more stringent in things I I recommended or, again, best practices that I would tend to follow or tend to recommend others follow.
I have since learned that it's very much relative of the data maturity of the team and the company as a whole, both for the consumers, as well as the producers of the data. So sometimes I, you know, go all the way down to just one big table. I don't think you should complicate your life more than that. Other times, I see people effectively building Data Vault and dozens, if not, upwards of a 100 of our systems and managing it effectively. So I think that's often underappreciated because nobody wants to admit the fact that, okay. Maybe we're you know, as as a team, maybe we're still fairly junior. Maybe we have a few rock stars, but as a team, we just can't lift this collectively.
And let's see if I can think of another that I haven't already mentioned.
[00:57:14] Tobias Macey:
And so as you continue to work in this space and explore the evolution of the marketplace and the general growth of data modeling as an established practice in organizations of varying size. Because up until the past few years, it was largely the domain of larger organizations and smaller companies just did whatever scrappy thing they needed to get by, didn't necessarily adopt these formal practices. I'm curious, what are some of the ways that you are staying up to speed or particular aspects of the problem space that you're particularly interested in exploring further?
[00:57:54] Serge Gershkovich:
Yeah. That's a great question. I think that one obvious consideration is as the ecosystem grows, the demands on the data modeling tool grow along with it. So people are aren't just limited to, you know, building a conceptual model or a physical model. They're gonna be using that model downstream or rather upstream of, let's say, dbt. So do we instead of just generating DDL, are we also able to generate dbt yaml and hand that off to an analytics engineer? As AI evolves, things like semantic models play a much bigger role. Nobody was asking for semantic models a year ago. Now they're a must have. So now can we take that same information and generate it as a semantic model? But what's really changing is is the format. So most of the core of that information, whether it's a semantic model or a DBT model, it's effectively the same information being conveyed in a different format. In fact, if the word semantic model scares you, or you're not used to seeing the semantic model, it's almost identical to a glossary. If you go all the way to the other side of the spectrum, past logical, past conceptual, even simpler than that, a data glossary has the same fields as a semantic model. It's just that one is almost like index cards. The other is in in YAML. But that's just some of the examples of, more tools. You mentioned MCP server earlier. You know, before we had an API, and now we have an an MCP server on top of that. They're still getting effectively the same context, the same information, the same nuance using the those methods.
One, as I said, is keeping up with the industry demands, being able to do the same thing we do with relational objects using semi structured NoSQL because that plays an ever bigger role in in people's tech stacks. That's just on the kind of the tech stack, the the technology stack side. Everything I said earlier about having a different set of challenges as the organizations get bigger. So not not about interpreting one data model or one set of, of information, but sharing it across teams, making sure we're not duplicating things unnecessarily, making sure we're securing things, making sure that we're tracking data lineage, not just from source systems, but also across the company. All of these things are issues that our tool helps solve. So, again, I've, I've worked at large companies. I've seen what I've seen. But having the privilege of talking to our customers and them sharing their use cases is, to me, an education, like an accelerated education. I couldn't have gotten anywhere else short of joining one of them and spending a couple of years just, understanding the nitty gritty and everything that they're working on. But the scope of what they're building is enormous and only a very disciplined and systematic approach to doing it is gonna help something like that succeed. So for me, it's also been a little humbling in terms of seeing what our customers are are able to to manage and build at scale and how they're doing it has been eye opening.
[01:01:20] Tobias Macey:
Are there any other aspects of this space of data modeling as a combination of organizational and technical exercise, your experience in the space, or the work that you're doing to help evangelize the best practices around how to implement it at organizational
[01:01:38] Serge Gershkovich:
scale that we didn't discuss yet that you'd like to cover before we close out the show? Well, I guess I'll make a not so shameless plug and just recommend my my book again. It's called Data Modeling with Snowflake. Don't get hung up on the Snowflake part of it. I just wanted to tie it to one specific system that someone can can actually iterate and and try these things out as, as recipes. And Snowflake is as good, candidate as any because, you know, you can literally get started for free. And, with a trial, it doesn't cost anything, so everyone can follow along. And it's a really great resource to understand why data modeling matters. So it's a book that just kinda shows ties the business value to the technical side of it. That is what are on the engineering side and the data side we tend to focus on. And if there's one message that I would like to leave your listeners with. It's the fact that data modeling is a team sport, and that team consists of business users and technical data users, data engineers, and data architects.
And it's folly to forget the the business side. So to close-up, I would just say, understand that a blueprint is important. Understand that your business is fundamental in helping shape that blueprint. Understand that they are not as technical as you are and things that are obvious to you might not be so obvious to them. It's almost a beginner mindset of why should somebody else care about the important work that I'm doing and how do I get them involved and make it easy for them to contribute? And just starting there is is a really good framework for for making sure your data initiatives are off to a good start and have a a higher chance of succeeding.
[01:03:28] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:46] Serge Gershkovich:
I would go with a lack of standards overall, and which is not to say that we don't have standards. It's usually the opposite problem of we have too many, we take your pick. And it makes it really difficult to standardize and make things effective. For example, you can ask anybody, any data engineer of, you know, if I wanted a description, where would I put it in the comment of a column or a table? That much is clear. But beyond that, how do I tag this? Is it dbt table properties or is it snowflake tags or take your system of choice? How do I expose this? We're starting to see things in in the catalog space. There's still no single standard. There's still no single standard for you name it. There's no single standard to do it. And that can trip people up in terms of which which tool do we bring in if it's not standardized. Can we use a tool? Can we use this tool? How do we make sure that the tools talk to each other? It's hard and it's overwhelming.
There's no right answer. There's it's holding the industry back, I think. On on the other hand, we know we've we've seen these cycles in the data space that first we couple, then we decouple. And fortunately, when when that happens, we at least tend to learn our lessons along the way of what works and what doesn't. So we're overall tend to be moving in the right direction, but it still makes it very difficult to recommend something and say that this is a one one size fits all solution. So it's really about keeping keeping up to date, staying staying current, and being able to to mix and match when when you need to.
[01:05:35] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and expertise on this overall practice of data modeling and some of the ways that it manifests in organizations and the role of technical teams in that overall process and how to best engage with the business. So I appreciate the time and effort you're putting into helping popularize that approach, and I hope you enjoy the rest of your day.
[01:06:01] Serge Gershkovich:
Thank you so much. It's been a pleasure. Great talking to you.
[01:06:11] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into week long success stories.
Enterprises today face an enormous challenge. They're investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern. And that's especially evident in large, complex enterprise data environments. That's That's why companies like DIRECTV and Pfizer rely on SQL DBM. Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy. In today's cloud era, that backbone needs a modern approach built natively for the cloud with direct connections to the very platforms driving your business forward. Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale.
SQL DBM ensures that enterprises don't just move to the cloud, they maximize their ROI by creating governed, scalable, and business aligned data environments. If global enterprises are using SQL DBM to tackle the biggest challenges in data management, analytics, and AI, isn't it worth exploring what it can do for yours? Visit dataengineeringpodcast.com/sqldbm, that's sqldbm, to learn more. Your host is Tobias Macy, and today I'm interviewing Serge Gershkovich about how and why data modeling is a socio technical endeavor. So, Serge, can you start by introducing yourself?
[00:02:16] Serge Gershkovich:
Sure. Serge Greshkovich. I am a data modeler, data architect by trade for most of my professional career, and I'm also the head of product at SQL DBM, which is a online database modeling tool. It's a SaaS product that runs in the browser. So data modeling is is very near and dear to my heart, and I am also a Snowflake data superhero, big fan of of Snowflake as as you can tell, and everything that they're doing. And, yeah, that's that's me in a nutshell.
[00:02:53] Tobias Macey:
And do you remember how you first got started working in data?
[00:02:56] Serge Gershkovich:
Kind of by accident. I think the way most people end up in data, nobody nobody tells their their parents at the at a tender young age that I wanna be in data when I grow up. But I started with technical degree with a computer science background, just kind of landed my first internship at, at the data side of of a of a company and and just kind of stayed there. It was it was interesting. It was challenging in a way that programming, I'm sure, has has different challenges, but that's where I ended up. It was in the SAP ecosystem to start, and I spent over a decade, in in SAP, first, BW, then HANA, then at the business intelligence suite along alongside that. So eventually decided once I saw Snowflake that I never wanted to touch another SAP system again, and I've lived up to that promise.
I'm very, very proud of that. And and, yeah, eventually just kind of shifted gears to product management.
[00:04:01] Tobias Macey:
And so in terms of the overall space of data modeling, that is a term that can be used across a large number of contexts and use cases. And I'm wondering if you can just start by giving your definition of the activities that are involved in that overall practice of data modeling.
[00:04:23] Serge Gershkovich:
That's a great question because I want to first maybe start with what it isn't and maybe dispel some some common myths that might persist for for those that are unfamiliar or not used to doing it on a on a regular basis or doing it systematically as part of a project or part of the data strategy at your current organization. So if I could kind of summarize it in in one phrase or image, it would just be a conversation between a business stakeholder or a subject matter expert and a data team. So data engineer, data architect, business intelligence team, analyst.
It's understanding what the business does and encoding that semantic information in such a way that our data models and our data structures are built in such a way that accurately capture those business requirements and the needs of the business. So we can, of course, broaden the definition from there. But at the core, I think if people walk away with anything, it's being able to align with your business stakeholders in in a way that everybody's on the same page and have some kind of semantics, whether those semantics are visual, verbal, technical, or preferably all three that everybody can understand and just share common talking point, common reference point around.
[00:05:52] Tobias Macey:
And on the flip side, you mentioned some of the things that it's not. And I'm wondering if you can talk to some of the overall buckets of misconceptions that you've seen people come into the overall practice of data modeling with where maybe they have a very narrow view of the set of tasks and responsibilities involved, or they have a different area of focus or a different set of objectives as far as what that data modeling practice is supposed to achieve and maybe some of the ways that that can derail the overall exercise.
[00:06:29] Serge Gershkovich:
Sure. I think the biggest one that it's it's not is is the misconception that it doesn't exist or it's optional. Joe Rees famously said that the lack of a data model is still a data model, albeit a crappy one. And this misconception has its roots in probably in the fact that most of us that work with data do so on the analytics side. So So we're looking at it not in the transactional system, but in the data warehouse where you could make the claim. We might not be broadly speaking formally modeling something. We might just be getting data delivered, landed in our raw layer, and then just working our way up to the transformations up to reporting. So, we're not formally creating ERD diagrams. We're not formally discussing our designs and the relationships that our dimensions have with our facts. But by by definition, we have a data model. It just we don't own it. It comes cascaded down to us from the source system, where, of course, you can't you can't have the sort of chaos you might find in a data warehouse in a transactional system because sales wouldn't get recorded, bookings wouldn't get made, etcetera. So there's absolutely a data model involved in everything you do. Our CRM systems, our master data management systems, they all have a a data model. And even if that data model lives in your head and if your data landscape is small enough for you to effectively keep it there, that doesn't mean that the data model doesn't exist. So that's the the biggest misconception.
The other one is that a data model is kind of, secondary. It's something that we do post fact. So we we build. And then, again, with there's as much as I love DBT, but they have a lot of blame in this misconception that you can build, you can write SQL, and you can effectively see Taz your way to a data model. And if you need to document something, then describe that DDL and and off you go. And this is what I why I started with fundamentally. It's much easier to to check something out on on paper before you you actually build the thing where you write lines of code and you commit business logic, you commit loading and testing and all of these resources deployment just to find out that you couldn't align on on the fundamental formula or an understanding of something. And the point I'm I'm making is that data modeling is just a a sanity check. It's if when it's done right in the order of talk to the business, get the requirement, encoded in something very simple, just a an ERD, for example, and then go off and build. Because the chance that you've made a mistake after two pairs of eyes, especially the the business owners or the subject matter experts have had a look at it, is dramatically smaller. And, you know, even DBT has the guardrails in place where you can absolutely enforce data contracts and things like that that prevent you from even building something that doesn't meet your original spec of data types required, number of columns, etcetera, references, and things like that. And the analogy I like to use is sometimes the like, just the map. A data model is effectively a map of where you're trying to go in your data strategy. And that is I'll come back to the transformational side of, of data modeling. Let's put a pin in that one. But fundamentally, it's a map. And do you need a map? Well, the answer is sometimes, not always. That if you're standing in front of, let's say, a shopping center and it's just a strip mall, you don't need to consult the directory because you see all five stores, you walk into the one you're looking for. If you're in an enormous shopping center, then, you know, you might probably use a map just to find your car in the parking lot. And you wouldn't just wander around aimlessly if you're looking for something very specific. It would be in your best interest to spend two seconds consulting the directory, walking directly to where you need to go and getting that information.
And unfortunately, that's another of the drivers that make people a lot of times overlook data modeling is the fact that just like in this analogy, if you walk into a shopping center and you just start wandering around, start asking strangers, you will eventually get to where you want to go. And what this looks like in the business world is let's take our previous example. If you're a data architect that's dealing with a data landscape that's effectively small enough to fit inside your head. Imagine it's a startup, it's a small company, but the company grows, your data sources grow, your data assets grow, your business units grow, your business teams and data teams grow.
The problem is you start to feel the symptoms. You start to feel the pain before you've identified that a problem exists because it feels quotidian. Before, I used to know everything. Now I didn't know, so I asked my colleague next to me. Tomorrow, he didn't know either, so we called a meeting with the business stakeholders. Tomorrow, the business stakeholders couldn't even really give us an answer, so we guessed and we built something. Turns out it was wrong, so we went back to design and we rebuilt it. All good. But the problem is it's like a a frog in boiling water that it was okay, and the the pain points are all incremental. So it feels like it's fine. Nothing's broken. But fast forward a year, year and a half, data quality issues, lack of trust, we're blaming, and we're getting blamed instead of collaborating with those business stakeholders like we're meant to. We're not delivering value because even when we get it right, the data is suspect.
And how did we get here? It wasn't overnight, but it was gradual. And this is another pain point that if you know to look for the red flags and the the warning signs, then you can avoid this doomsday scenario. And, unfortunately, it's one that I think anybody that has worked in data has seen some figment of this, whether it's full blown or just just starting because, again, no no data strategy is perfect. But these are just tools that help people orient and help understand what the company is doing and what it is we're capturing about what the company is doing and then presenting that to the decision makers.
[00:12:58] Tobias Macey:
Another point that you raised in terms of this overall space of data modeling is, in particular, the source systems that you're pulling data from, but also data modeling is not something that is the sole domain or dominion of the data warehouse environment. You also need to incorporate data modeling in exercises such as streaming ingest or if you're doing any data distribution to things like applications or other contexts or if you're trying to incorporate unstructured or semi structured data assets either into a warehouse environment or an application context, you still need to understand what is the actual purpose and semantics of that data.
And I'm wondering how you see that recognition of the other modalities and localities of data being brought into that overall conversation of data modeling, and what bearing does this piece of information or this collection of information have on the overall business domain model?
[00:14:03] Serge Gershkovich:
Great question. Because I would say that I I don't see it as or at least as much as I would like to. And just like myself, many people end up in the data world without a formal background. So I've seen many people come even from the business side that they they were the data expert on their team because they were managing the big Excel that nobody knew how to make sense of. And they eventually kinda get formalized into the data team and get taught the fundamentals of SQL, and then then they're off and running. But many people, for example, don't know that our even what we call a database today is is founded in relational theory, which is mathematics.
So the technology that we're using, of course, all of the cloud compute and caching and storage, there have been massive innovations there. But fundamentally, what is why do we call it a relational database? Because it is grounded in relational theory, which goes back to the sixties and seventies at Gurkad that he codified all of this and eventually that gave rise to the database as we know it today. Concepts like rows and columns. Like, that was revolutionary in at some point. And when all of this was getting codified, transformational modeling, and I'll I'll and like I said, I'll get back to what I think that term should encompass, just was nascent or almost didn't exist. The fact that you would be analyzing data across systems and not just within one system was not yet a thing. And today, when we're looking at the contrast between a transactional system where all of our models for, largely speaking, can be expressed in that classical definition of entities, relationships, dependencies, primary and foreign keys.
That is not so much the case on the warehousing side, where the OLAP system has a different type of dependency. So it's not so much a formal parent child relationship, but it's it is formal in the sense that if the table I'm pulling from tomorrow is missing the column I'm trying to pull, then the pipeline is gonna break. So this is a transformational dependency. It's not to say that the dimension is a conformed dimension from parent to child, but you have a data pipeline that depends on all of these things coming together. So what are you trying to solve on on the OLAP side and OLTP side? On the OLTP side, you're just trying to make sure that everything is as close to to normal as possible and that your data is consistent. It's not duplicated.
And you're able to basically maintain it in such a way that avoids all types of errors, dependencies, and, and update errors, anomalies, etcetera. On the data warehouse side, you know, we can almost guarantee that if we built something upstream that our data is pretty much clean when it lands. Of course, not always the case, but let's just go with it. On the data warehouse side, you have, as you mentioned, you have multiple systems. You have conformed dimensions by definition. You have transformational dependencies. So now let's unpack that. What does that mean? That when the data lands, it's quite dirty. It needs to be renamed. It needs to be cleaned up. It needs to be consistent in such a way. Then there's all the different methodologies that you could be using. And we'll park that for a later question, but let's just go with the general. You have, let's go medallion or staging, reporting, whatever you wanna use. You have a raw layer that is just data as is. You have a staging layer where it's a little bit more conformed and normalized, but still not overly aggregated. And then you have a reporting layer that is maybe denormalized completely in one big table if you wanna go to that extreme. But regardless, you have pipelines that transform the data across those three layers. So you have dependencies around selection, and you have you have quality issues in terms of, are you calling the same field by different names along different parts of that process?
You have the fact that you can have multiple processes working on that data and creating, let's say, the the sales view and the marketing view of our trading information. And now by definition, you have kind of the core is duplicated, and then some of the logic around it is specific to a certain department. Which one is right? Can I pull yearly totals from either one, or have they been distorted by business logic that's specific to each business unit? So now you run into all sorts of problems that didn't exist on the transform on the transactional side. So how do we manage that? So the word a lot of these things, for example, if you take a view, what is a view doing? It's applying logic, SQL logic to a dataset and giving you a different kind of result.
Unfortunately, the word logical has already been co opted. So informal modeling, generally, we start with a conceptual model, which is very basic. Then we go to the logical layer, which is similar to physical, but not really tied to any type of physical database that makes it easy to understand and also then transform to a physical database or multiple physical instantiations. And then physical, where you're actually describing the the particulars like clustering and partitions and column data types, things that are that are gonna help you effectively store the data. Where on the logical side, you're not so much concerned about storage. You're concerned about nuances, business nuances of is this a subtype of this other type, or how does a customer place an order? Can a customer must a customer place an order to even be a customer? Things like that that you're trying to iron out.
And what is a good word to call everything that happens in a data warehouse? I am trying to go with transformational. That in my intro, I forgot to to mention that I'm also an author. So I have a book called Data Modeling with Snowflake, where after many conversations with folks, because of the nature of my role at SQL DBM, People kind of understand the need, the importance, or they've just heard the term data modeling. They want a a little bit of a foothold because, like we said earlier, maybe they didn't have a formal background in it. People would come to me and say, can you recommend some reading material? And, honestly, I was struggling. Like, I wasn't gonna throw CJ date at them because they were, you know, they just came from the business world. And I thought, let me write something that is just practical and to the point of what it is, why do it, how it's gonna help you, and here's some basic formulas. That's why I wrote the book, Data Modeling with Snowflake specifically, so that users can actually not just read about a type two dimension, but actually take a script that is optimized for Snowflake architecture and create one and see what it looks like. And of course, I I didn't stop there. I because even a type two dimension, there's there's transformations taking place. So it's important to kind of bring this all tie this all back together. It's important to understand that the transformational component of modeling is real, is important to document, to understand, to codify in a way again, it looks more like data lineage rather than an entity relationship diagram.
But it's still at the heart of it, one entity being connected to another entity, not as a relationship, but as a dependency, for example. Because those are the types of questions that come up in the data warehousing context of, I'm seeing this pretty dashboard, but where is this information coming from? How do I trust it? Where how do I trace it back to the source? And of course, there's a whole ecosystem of monitoring tools, lineage tools, catalogs. And it could be a little overwhelming if it's done first of all, if it's done at all. Second of all, if it's done, is it being done on somebody's prerogative of let's just buy something just so I can claim we have an installation of such a thing? Or is it actually being implemented to answer practical business questions of lineage? And is it up to date, etcetera? So very long answer to actually a very complicated problem. So I hope I kind of covered the the edges.
[00:22:48] Tobias Macey:
No. I think that that that's a great summary of that problem space and digging into some of that technical transformational aspect where you were mentioning things like type two dimensions, conformed dimensions, dimensional modeling. When you're in that context of building the warehouse as that is the effectively digital twin of the organization, one of the stumbling blocks or points of analysis paralysis that can often come up is deciding on what is that dimensional approach that I'm going to take. Am I going to do the Kimbell style star schema? Am I going to do data vault? Am I going to go and explore anchor modeling? Do I just throw up my hands and do the one big table approach? And I'm wondering how you're seeing teams try to maybe leapfrog or short circuit some of that analysis paralysis and just start doing some building and discovery to then be able to build the feedback loops that help to inform exactly which approach to take or exactly how to implement whichever approach they're leaning towards?
[00:23:54] Serge Gershkovich:
That's a great question, and I think I have a a very unique perspective on it that is that could be very helpful because especially in my role now as as head of product at SQL DBM, not only do I work on the product side, but I naturally also meet with our our customers. And not just customers, but people who are early in the funnel. Maybe they're just trialing the tool. And I get to to see not just one use case as I would, let's say, as an architect working with a company long term or even as a consultant working on a project over the course of six months to a year. But I get to speak with dozens of people a week, a month, and I get to see I get to touch base with them over the those increments of first week, first month, first six months, a year. And I think that the failures, the the use cases that that didn't pan out are just as telling and informative as as the ones that did. So one of the things that I I I kind of saw from this vantage point is one of the counterintuitive things I would say is technical prowess counterintuitively, is not a guarantor of success for data initiatives.
Very early on, and this caught me off guard, that I would be speaking to people who some of them, like, literally would brag about having met Kimball and Inman and and having been taught by them. So initially, naively, I would I would think, surely, they know what they're doing. And then fast forward six months, I see that they're not really using the tool. Fast forward a year, they're not renewing because their plan basically fell apart. And what went wrong is they effectively said, well, I'm the architect. I just need I'm the one that's gonna be doing the heavy lifting. So I'll just need a license for myself, and I'll I'll get her done. Problem. They are not involving the business.
So they're not asking for viewer licenses. They're not asking for a an embed of this into their their wiki or their confluence that as they're building, somebody can the business can look on and sense check what they're doing. So that's number one. That the number one biggest mistake I see, the biggest red flag to a project that is likely to flounder is lack of understanding that the data team does not own the concept of what it is they're building. They own its implementation. They don't necessarily know or by design, they don't know as well as their business colleagues the intricacies of the things they're supposed to be capturing.
Number two is I think you kind of hinted at this. A lot of times, people want the the awards. They want the the accolades on their resume. They want the the stripes on their epaulettes of I implemented data vault because they heard about it. They saw it succeed in other context, or I did anchor modeling. There are people who understand anchor modeling. They are comfortable in six normal form and God bless them. There are people that just kind of read the textbook and are, you know, ready to go all in and risk their company's data strategy on it. And that's another problem of wanting to do something for the sake of doing it, I would say, is the second biggest reason I see implementations fail where there's not the need just isn't there to justify that amount of complexity.
And just because sometimes it's an ego problem that you might have somebody, like, a really talented data architect, seasoned veteran, certified, what have you, but not their team. So their team might be relatively junior to them. So they can't maintain or they make mistakes that the architect, wouldn't have made that they're not able to effective it's holding them back more than it's accelerating what they're doing. Of course, at a certain point, you know, you can't survive without Data Vault or something to handle 50 source systems. You're not gonna try to do that by hand.
And the last one I would say is not having is kind of the the opposite of what I previously said. The previous example is going to the other extreme of not having a control over the fundamentals. And when I say fundamentals, I really mean fundamentals. So, when I said, what is data modeling? It's a shared concept of semantics because semantics communicate a density of information that words sometimes when there's lots of them simply cannot. So those semantics could be visual. We're not recording, but I like to show a slide of just a list of tables that you would see when you log into any database. You open the database, open the schema, you see a list of tables.
Even when it's like 10 tables and they're properly cleanly named customer product sales, it might still take, us even, an expert, you know, some data wrangling to understand, okay, is a location a customer location or is it a supplier location? A simple diagram with just boxes and lines can solve that problem visually, instantly. Other semantics are verbal. Earlier, we we said the words type two dimension. That's three words that to many of you likely symbolize so much more than than what is what is in those three words. For example, you know you can expect the granularity of that table that it's not just gonna be the business key. You understand that it's likely gonna have columns similar to from and to date, if not others. You maybe even you know the pseudo code required to load one effectively.
So these are the things that we expect that when I say type two dimension, you don't just go scratching your head and building something random. Same thing goes for business concepts of when I say active customer, that's a that's a popular one because nobody has an accurate definition of what an active customer is, or there's probably multiple of them. And unless we know how to steward that that knowledge, that context to say, okay. We used to have just one definition. Now we have three departments that have their own definitions. This is the the company standard. This is the marketing standard, etcetera.
These are the things that can when two people are kind of looking at each other and nodding and not necessarily thinking the same thing, this is what leads to problems. So back to the the point I was making that the other issue is, again, a lack of fundamentals that they drive they're they're principles that drive success. They're just best practice patterns of knowing you don't have to necessarily model in Kimbell, but knowing what a fact versus a dimension is is very helpful just in trying to describe what type of information a certain table, contains.
So some of these things are foundational, and a lot of times, like I said, they're they're absent from from the vernacular. So that's why I started with the very first question of what is data modeling. It's not necessarily an ERD diagram. It's the minimal effective dose of consensus that moves us forward as a team.
[00:31:28] Tobias Macey:
So we've been narrowing in from the overarching purpose through the incorporation of the business concepts and doing the organizational modeling, trying to map that to a technical representation where the general idea in data warehousing is to serve as some sort of digital representation of the real world manifestation of the business and its processes so that you can then do some analysis and reporting and maybe prediction on the overall functioning of the business. And that inevitably involves the technical details that we've been touching on and that those technical details can be the dimensional approach that we've been discussing. It can be the specific database engine. It can be the ways that the data is represented or stored, whether that is in a lake or a warehouse or as free text, etcetera.
And I'm curious what you're seeing as the most significant impacts that the technical decisions tend to have in either accelerating or hindering the capabilities of the organization to actually implement that digital twin representation of the business.
[00:32:49] Serge Gershkovich:
It's a mixed bag because technology either accelerates our progress or just allows us to to fail faster. If it's complicating more than it's adding value, then it's doing the opposite of what it's meant to do. So there's plenty of use cases for everything that you mentioned, starting, let's say, with, semi structured data that we've seen relational kind of set the groundwork. But sometimes, especially when we're dealing with IoT, with sensor information, with logs, they're semi structured, or we're actually using our data lake house or data data lake to our full potential that has unstructured data.
These are just ways of encoding, being able to quickly process the the information as as it's being loaded. But how do we ensure that things inside these, nested relationships are are accurate? So our brains are not great just like we're not great at visualizing exponential growth. We're not great at visualizing nested data, especially once it goes beyond one or two levels. And, in fact, one of the exercises in my book is I set out to prove why why the schema on read approach failed. And it's not because it's slower. It's not because modern systems can't can't handle semi structured data just as well as structured. It's because is does not imply ought fundamentally.
That it's actually the same problem with relational, but the the problem with that further compounds with semi structured is once you tear it apart, you realize that there are multiple interpretations of it, that you just have certain properties under parent items, but you don't necessarily know, are they their own entity or how many things you could possibly have. And also if that specific feed, let's say, it doesn't contain a certain property, you don't know that the entity even exists. So how are you meant to verify this with your business stakeholders just to get a conceptual grasp of, of of what you're looking at? And when we're talking about everything, everything that is possible today, which is, you know, there's so much stuff coming at us in terms of what we can possibly process and how we can process it. I just by, I guess, my my experience and and my background, I'm I'm not a data engineer, so I'm not gonna weigh in on what is most efficient, what is most cost effective.
I kinda limit myself to going back to the the understanding of we can process anything that we set our our minds to process, but how do we make sure that we're dealing with information that we understand and processing it in the correct way? No matter what the format, if I go to the example I used in my book is just pirates and pirate ships, things people love, not to not to get overly technical. It doesn't matter that it's in a JSON format. What does matter is I can't bring JSON to my business team or my captain, in this case, and say, hey. Can you, assess the crew off of this JSON file?
I need to show them something basic, just two boxes that says this is your your captain. This is your ship. A pack a captain has a ship. A crew can have these weapons or cannot, etcetera. Is this correct? And that's something that they can very quickly confirm or or deny. And for me, I think that's I'll limit myself to just saying, use every tool in your toolkit, but make sure that you also have a reference point to understanding the information and sense checking within that information with people who are less technical than you might be.
[00:36:46] Tobias Macey:
And then inevitably, we have to address the elephant in the room that is AI and the impact that it is having on this overall practice of data modeling, technical representations of the business processes, etcetera. And given the broad applicability of LLMs, I can imagine them being used across each of the different stages of the data modeling life cycle from translating documents, you know, business policies, conversation, transcripts, etcetera, into some summarized form of these are the things that we care about. This is what the semantic representation should be for this particular process, doing the actual SQL generation or schema generation or ERD generation, etcetera. And I'm curious how you're seeing the introduction of LLMs into this overall process and life cycle both accelerate and potentially derail that overall effort?
[00:37:50] Serge Gershkovich:
Alright. A lot to unpack there. Let's let's start kind of on first of all, by acknowledging that, like you said, LLMs are ubiquitous. AI is here to stay. And if you're not keeping up, you're gonna get left behind like never before. It's going to be, like, as just as you mentioned, it's gonna be there along the way at every step of of your data journey from the way data is generated to the assistance that help you along along the way, whether it's creating descriptions or creating the the data model itself, to making predictions about how that data asking business questions and predictions around that data. Let's kind of go with the theme of the previous question of AI as a tool. How do we make sure that the tool is working with us and not against us?
The lessons of previous iterations and previous business, hypes should not be forgotten. So we just we hearken to schema on on read, which was the debacle of the February or so where big data was supposed to solve everything. Data lake was gonna be the new new best thing. We don't have to do any modeling. We don't even have to do any transformation. The system will figure it out. Well, we know how that ended. AI is also promising the system will figure it out. And in this case, it actually has a chance of of succeeding at that. However, what is an LLM? What is AI at at heart? It is a pattern based prediction generator.
I'm oversimplifying, but let's just run with this definition for now. If we're going to ask AI to answer a business question, understand a whole volume of of context all in a single go, wouldn't it make sense for us to make sure that it's doing the least amount of guessing and interpretation possible? Meaning, wouldn't we try to clean up the dataset before we tried to make inferences out of it? The difference between AI today and things we were doing, let's say, in the early twenty tens is is cost. That even though transactional costs have been falling, storage costs have been falling, that has been the trend. And AI has just flipped that on its head, meaning the cost of being wrong has just gone up. That it's not a matter of reloading a multi terabyte table using an ELT that we can live with.
Retraining a multimillion dollar model that we cannot live with. Heads will roll. Therefore, I would encourage people not to to believe the hype, but not to dismiss the magic, the magical thinking of I can just roll my put my hands up and AI will do it for me. So in this regard, data modeling, again, not drawing data entity relationship diagrams, but cleaning up your damn data set is more important than it's ever been. I've heard plenty of examples where people that have kind of started doing semantic models and and AI models around data, they immediately bump up against the fact that you can't build a semantic model without having a pretty good understanding of your conformed physical model.
Very quickly, it exposes the duplications, the misalignments, the double names, and it's all a part of the same process. It you can kind of it's a spectrum of how how detailed versus how general it is that you're talking about, whether it's customer generally or customer ID specifically in your physical system. If you have customer and you have client across two different tables, what do you expect the result will be when when you run AI analysis on that? So these are the kind of basic things that, you know, you don't have to be an AI engineer to understand that you you just have to go back to the drawing board and make sure that your fundamentals are are correct. But now to the other question of where else does AI come in along that journey? Well, here we open up the doors to the realm of possibility.
All of us have examples in our lives where AI has come in and fundamentally systematized or taken a lot of drudgery out of our our day by making it quick and easy. So if you need AI to help you get to to that point of conforming your your dimensions, cleaning up your datasets, or even flagging the the parts of the data landscape that are problematic. AI could be very good even in serving your AI strategy, kind of cleaning up itself before running more more system intensive and cost intensive processes. Again, I'm not gonna say I'm not gonna list every step of the way where AI can step in and make your life easier because those are those are infinite. What I will say is what I what I started with that you have to have a, by this point, reflexive instinct to see where can AI help.
If you're trying to solve a problem if you're faced with a problem and your first instinct is, how do I manually figure this out? Not reflexively speaking, where can AI make this easier? You're going to get left behind. You're just gonna make yourself obsolete. But again, once that is in place, how is AI gonna help you clean up your data? Then once once you've gotten to a place where you're comfortable throwing AI resources at something, you've already ensured success, basically, as much as you can by making AI's job as easy as possible.
[00:43:55] Tobias Macey:
And to your point of you can't just throw an LLM at the problem and expect it to solve everything without it having some means of grounding itself in the context and semantics of the organization where you highlighted that trying to build that semantic representation for the LLM to feed off of, at least on the consumption side, can help highlight some of the shortcomings in your existing data estate. And one of the recent developments that has been accelerating rapidly is the introduction of tool use and in particular, model context protocol servers. And I'm wondering how you're seeing that inclusion of tool use as a means of building that grounding in the business context and business realities helps to provide guardrails and improve the effectiveness of the incorporation of LLMs into each of the stage of the life cycle from ideation and definition through to consumption and exploration?
[00:45:00] Serge Gershkovich:
That's a good question because MCP is still relatively new and data teams, they span the gamut of There are those that just operate on SQL. They haven't even really dipped into the Python realm of data analysis. And then there are modern, teams that are comfortable with calling an LLM using an MCP server. But they're still relatively few. And especially in the data space, one, I would say we're still early. For example, if you're staying up to date with some of the innovations of the cloud data platforms like, Snowflake and and Databricks, they're introducing features that just boggle the mind for even someone like me that has, you know, come up through the ranks of analysis and data architecture that you can effectively write a prompt in line in SQL. So you can select column one, concatenate with column two, and then in your third column, put a prompt to say, you know, analyze the sentiment of of the string that I've concatenated this. So you're effectively for every row in your in your result, you're seeing an LLM response.
And as we said, with great power comes great responsibility. So these are you know, you're not just gonna be spending your regular compute tokens on this. If you're making mistakes there, it's gonna get expensive. And also the use cases are infinite in terms of, you know, I might have to have just to get some kind of basic sentiment analysis, I might have had to do, like, some gnarly regex just to to tease out is this positive negative keywords. Now AI can pretty pretty confidently tell me, like, this is a positive review. This is a negative review. So it does make life easy. Some of it is, you know, provided by by the platforms themselves.
A lot of teams are, as you mentioned, they're running their own models for security reasons, for performance reasons, privacy reasons that that they can effectively hook into using MCP or just regular API calls. And because they have use cases that demand it. Again, I'm not really the best person to ask at this stage in the game just because I'm currently not hands on with any specific data problem. I'm I'm working on a tool that that accelerates peep the life of people who are. What I do here is specifically for the advanced use cases from from customers that I interact with. And for them, the you know, there's the technical challenge, kind of the same thing we we just discussed. There's the technical challenge, but there's also the challenge of making sure that once they once they write those pipelines, that the core understanding of what it is they're analyzing is is there for them.
So it's almost like the old is new again. That sure, you can have infinitely new and advanced processing systems on top of your old columns and rows. But the fundamental understanding of what is a customer and what is an active customer and where is the table and what are the synonyms for customer, all of that has to be encoded, unfortunately. Otherwise, you're setting yourself up for failure and failure that, you know, you're gonna blame AI. You're gonna say this isn't smart enough when, you know, your business users probably can't even figure this out if you have so many inconsistencies. That if if you have to, you know, the context has to live somewhere that we started with the with Joe Reese's famous quote. The data model is still a data model, albeit a crappy one. If it's not systematically encoded, let's say, in a semantic model, it's living in your head. And if it's living in your head, it's not available for AI to consume and orient itself. And that's gonna be a problem and an inexpensive one.
[00:49:00] Tobias Macey:
In your work of helping organizations approach that overall challenge of data modeling as more than just a technical exercise, What are some of the most interesting or innovative or unexpected ways that you've seen them develop and implement the overall strategy for that? Okay. That's a great question because,
[00:49:18] Serge Gershkovich:
again, from my unique perspective, there are people who are hands on with data in the trenches. There are people who are hands on with the engineering side of things. I am not a consultant, so I'm not coming in and and actually, you know, but just by design for privacy reasons, I'm actually not looking at what our customers are are building unless they're willing to share that with me. What I do see are the frameworks and the technical demands around how they operate. And, again, if you're a consultant and you've seen hundreds of use cases, maybe some of this won't surprise you. But as I've seen SQL DBM grow from just a modeling tool to a enterprise modeling platform. I've seen the needs of the customers evolve from, again, just basic, can I document something? Can I draw the relationships?
Do you support all of the database objects and properties natively? Once the tool evolved beyond that point, the types of, requests we were getting is, you know, it's not one project, it's multiple data domains, and we wanna communicate. We want to make sure that people are consuming our data products. So it's things like being able to reference, first of all, data discovery. How do I discover the fact that a certain data source or a certain data domain even exists? How do I reference it without copying it? So the fact that we're an online platform means we make it very easy for for users to take things like take things like entities from one project and create a a pointer or a reference or read only copy in another project and model downstream from that. The other thing we're seeing is, and I'm I'm not exaggerating, that we have data architects that oversee dozens, if not upwards of a 100 or more data teams. So some of our features, like global standards, are ways where a single data architect or a data team can enforce things down to their their child projects. So what are the conventions we're gonna follow? What are the templates we're gonna use? What are how are we gonna name things consistently?
Not only is it important for enterprise teams to define that once and only once, but also to make sure it's not just suggested, but enforced. So we have tools on the guardrails to make sure that happens. Once you get to a use case of that size, you know, you would expect you're kinda thinking openness. But a lot of times, it's openness with guardrails. So a lot of companies have when especially when they provide data as a product or data as a service, they have internal teams that are encouraged to share, but share to a point. So it's not even about making something accessible. It's making something accessible and making sure that people only see the parts of it that you want them to see. So there's also privacy and our back and restrictions even at the data model level. Another part is security and compliance.
So everything that we're building, it's easy to forget that it's it's not just data, but there are laws around the data. So it's easy to forget until you get slapped with a million dollar fine. So things like that are also very important of not just data itself, but the metadata, the the tagging, the PII flags, the stewardship around, is this a trusted table, or is this a staging area? Can I consume it? All of these things, the the more mature a organization becomes, the more, you know, even these things cannot live in in somebody's head. They they need to be systematized. They have to be fed in through the same CICD process that deploys your object is the same thing that, you know, stamps the metadata tag saying, this could be promoted up to the reporting layer, and this cannot, or this should have certain tags and and policies applied to it to make sure that we're not accidentally exposing it to people who shouldn't have visibility.
So for me, the the evolution of just simply, how do I document what is what is one table? What are its columns? What are the columns mean? All the way up to how do I search across an organization where we have upwards of a 100 of data teams or more? Each of those data teams have projects or data domains that have thousands of of tables, and each of those has definitions and relationships. How do we secure it? How do we share it? How do we find it? Those are the main challenges. How do we search across everything? And those are, you know, those those are unique and interesting challenges to solve. You know, there's plenty of people working on on the engineering side and the data modeling side is is equally important, especially once you get past a certain size as as an organization.
[00:54:31] Tobias Macey:
In your experience of working in this space, talking to customers, coming to grips with all of the different details and nuance around data modeling as that socio technical exercise, what are some of the most interesting or unexpected or or unexpected or challenging lessons that you've learned personally?
[00:54:47] Serge Gershkovich:
From my perspective, I think the most interesting thing is just the wide gamut of methodologies and standards that people employ. Even as an architect of nearly two decades in the industry, you know, I've only seen what I've seen. I've been a consultant, so I've seen kind of some of the accelerated pace of, product development. I've worked at companies as a data architect kind of full time. So I've just gotten to know one industry or several industries in-depth. But having, again, dozens of people a week that I speak to from our client side, I have seen every type of standard imaginable.
Learning to kind of design for the extremes, I guess, in in my role has been very important. Not confining myself to the best practices because we certainly want to guide people to best practices, but also respect that, you know, they might have their own ways of doing things. They might also have their own set of exceptions that even when we set rules, we wanna also be able to grandfather certain legacy things into their design. That's number one. Number two is the degree of data maturity really matters. That as a data architect before joining SQL DBM, I think I might have been a little bit more stringent in things I I recommended or, again, best practices that I would tend to follow or tend to recommend others follow.
I have since learned that it's very much relative of the data maturity of the team and the company as a whole, both for the consumers, as well as the producers of the data. So sometimes I, you know, go all the way down to just one big table. I don't think you should complicate your life more than that. Other times, I see people effectively building Data Vault and dozens, if not, upwards of a 100 of our systems and managing it effectively. So I think that's often underappreciated because nobody wants to admit the fact that, okay. Maybe we're you know, as as a team, maybe we're still fairly junior. Maybe we have a few rock stars, but as a team, we just can't lift this collectively.
And let's see if I can think of another that I haven't already mentioned.
[00:57:14] Tobias Macey:
And so as you continue to work in this space and explore the evolution of the marketplace and the general growth of data modeling as an established practice in organizations of varying size. Because up until the past few years, it was largely the domain of larger organizations and smaller companies just did whatever scrappy thing they needed to get by, didn't necessarily adopt these formal practices. I'm curious, what are some of the ways that you are staying up to speed or particular aspects of the problem space that you're particularly interested in exploring further?
[00:57:54] Serge Gershkovich:
Yeah. That's a great question. I think that one obvious consideration is as the ecosystem grows, the demands on the data modeling tool grow along with it. So people are aren't just limited to, you know, building a conceptual model or a physical model. They're gonna be using that model downstream or rather upstream of, let's say, dbt. So do we instead of just generating DDL, are we also able to generate dbt yaml and hand that off to an analytics engineer? As AI evolves, things like semantic models play a much bigger role. Nobody was asking for semantic models a year ago. Now they're a must have. So now can we take that same information and generate it as a semantic model? But what's really changing is is the format. So most of the core of that information, whether it's a semantic model or a DBT model, it's effectively the same information being conveyed in a different format. In fact, if the word semantic model scares you, or you're not used to seeing the semantic model, it's almost identical to a glossary. If you go all the way to the other side of the spectrum, past logical, past conceptual, even simpler than that, a data glossary has the same fields as a semantic model. It's just that one is almost like index cards. The other is in in YAML. But that's just some of the examples of, more tools. You mentioned MCP server earlier. You know, before we had an API, and now we have an an MCP server on top of that. They're still getting effectively the same context, the same information, the same nuance using the those methods.
One, as I said, is keeping up with the industry demands, being able to do the same thing we do with relational objects using semi structured NoSQL because that plays an ever bigger role in in people's tech stacks. That's just on the kind of the tech stack, the the technology stack side. Everything I said earlier about having a different set of challenges as the organizations get bigger. So not not about interpreting one data model or one set of, of information, but sharing it across teams, making sure we're not duplicating things unnecessarily, making sure we're securing things, making sure that we're tracking data lineage, not just from source systems, but also across the company. All of these things are issues that our tool helps solve. So, again, I've, I've worked at large companies. I've seen what I've seen. But having the privilege of talking to our customers and them sharing their use cases is, to me, an education, like an accelerated education. I couldn't have gotten anywhere else short of joining one of them and spending a couple of years just, understanding the nitty gritty and everything that they're working on. But the scope of what they're building is enormous and only a very disciplined and systematic approach to doing it is gonna help something like that succeed. So for me, it's also been a little humbling in terms of seeing what our customers are are able to to manage and build at scale and how they're doing it has been eye opening.
[01:01:20] Tobias Macey:
Are there any other aspects of this space of data modeling as a combination of organizational and technical exercise, your experience in the space, or the work that you're doing to help evangelize the best practices around how to implement it at organizational
[01:01:38] Serge Gershkovich:
scale that we didn't discuss yet that you'd like to cover before we close out the show? Well, I guess I'll make a not so shameless plug and just recommend my my book again. It's called Data Modeling with Snowflake. Don't get hung up on the Snowflake part of it. I just wanted to tie it to one specific system that someone can can actually iterate and and try these things out as, as recipes. And Snowflake is as good, candidate as any because, you know, you can literally get started for free. And, with a trial, it doesn't cost anything, so everyone can follow along. And it's a really great resource to understand why data modeling matters. So it's a book that just kinda shows ties the business value to the technical side of it. That is what are on the engineering side and the data side we tend to focus on. And if there's one message that I would like to leave your listeners with. It's the fact that data modeling is a team sport, and that team consists of business users and technical data users, data engineers, and data architects.
And it's folly to forget the the business side. So to close-up, I would just say, understand that a blueprint is important. Understand that your business is fundamental in helping shape that blueprint. Understand that they are not as technical as you are and things that are obvious to you might not be so obvious to them. It's almost a beginner mindset of why should somebody else care about the important work that I'm doing and how do I get them involved and make it easy for them to contribute? And just starting there is is a really good framework for for making sure your data initiatives are off to a good start and have a a higher chance of succeeding.
[01:03:28] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:46] Serge Gershkovich:
I would go with a lack of standards overall, and which is not to say that we don't have standards. It's usually the opposite problem of we have too many, we take your pick. And it makes it really difficult to standardize and make things effective. For example, you can ask anybody, any data engineer of, you know, if I wanted a description, where would I put it in the comment of a column or a table? That much is clear. But beyond that, how do I tag this? Is it dbt table properties or is it snowflake tags or take your system of choice? How do I expose this? We're starting to see things in in the catalog space. There's still no single standard. There's still no single standard for you name it. There's no single standard to do it. And that can trip people up in terms of which which tool do we bring in if it's not standardized. Can we use a tool? Can we use this tool? How do we make sure that the tools talk to each other? It's hard and it's overwhelming.
There's no right answer. There's it's holding the industry back, I think. On on the other hand, we know we've we've seen these cycles in the data space that first we couple, then we decouple. And fortunately, when when that happens, we at least tend to learn our lessons along the way of what works and what doesn't. So we're overall tend to be moving in the right direction, but it still makes it very difficult to recommend something and say that this is a one one size fits all solution. So it's really about keeping keeping up to date, staying staying current, and being able to to mix and match when when you need to.
[01:05:35] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and expertise on this overall practice of data modeling and some of the ways that it manifests in organizations and the role of technical teams in that overall process and how to best engage with the business. So I appreciate the time and effort you're putting into helping popularize that approach, and I hope you enjoy the rest of your day.
[01:06:01] Serge Gershkovich:
Thank you so much. It's been a pleasure. Great talking to you.
[01:06:11] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Serge Gershkovich and Data Modeling
Understanding Data Modeling: Definition and Misconceptions
Challenges in Data Modeling and Business Alignment
Technical Prowess and Data Modeling Success
Impact of Technical Decisions on Data Strategy
AI's Role in Data Modeling
Tool Use and Model Context Protocol Servers
Innovative Strategies in Data Modeling
Lessons Learned in Data Modeling
Future Directions and Challenges in Data Modeling