Data Modeling That Evolves With Your Business Using Data Vault

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And setting up and managing a data warehouse for your business analytics is a huge task.

Integrating real time data makes it even more challenging, but the insights you obtained could make or break your business growth.

You deserve a data warehouse the

open

the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights.

And Altenity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics.

Go to data engineering podcast.com/altinity.

That's a l t I n I t y for a free consultation to find out how they can help you today.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Tereneum Global Intelligence,

ODSC,

and Data Council.

Upcoming events include the Software Architecture Conference in New York, Strata Data in San Jose, and PyCon US in Pittsburgh.

Go to data engineering podcast.com/conferences

to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Kent Granziano again, this time about the data vault modeling approach and the role that it plays in the current data landscape. So, Kent, can you start by introducing yourself? Sure. I'm, Kent Graziano. I am the chief technical evangelist

for Snowflake,

and I'm also a blogger. My blog, the data warrior.

Over the years, I have been involved in data management

from,

these traditional

relational databases and doing OLTP

type modeling in

getting into data warehousing in the mid nineties.

Was a a student of Bill Inman and Claudia Imhof. Got my early start working with them and coauthoring

a book with with Bill on standard data models. Then,

eventually, I did run into this

method and approach called Data Vault,

in the early 2000

and

became sort of a, I guess, a a disciple or an acolyte

of that particular

approach to data warehousing and modeling

and

have been working with that now since, again, the early 2000. And do you remember how you first got involved in the area of data management?

Yeah. Actually, I started off as actually a scientific

programmer doing basic and Fortran

and was doing

contract work with the US Department of Interior out in Denver,

shortly after I got out of college.

And I happened to get into

a position

with working with the Bureau of Mines.

And the

manager for the consultancy that I was working with came to me 1 day and said, hey, I just came back from

a seminar put on by IBM about this thing called relational

database

design. Would you be interested in in learning about that?

And, you know, sounded sounded pretty interesting. I'd been working with data 1 way or another for a number of years at that point, and he then went through sort of a tutorial with me about what relational,

was all about, what the relational theory was all about, and

then had me do a little POC with a a small database.

After going through that, we decided that that was the approach to solve 1 of the problems for, the bureau

and decided to go pick up

a early copy of Oracle. It was actually Oracle version 5.

I got to install it, configure it, and design the very first system that

the Department of Interior used

on a Oracle relational database.

And it was kind of, you know,

almost love at first sight. For some reason, the other relational

concepts and the modeling approach just appealed to me

and that the rest as they say is history. I mean, that was 19

a long time ago

in the last century.

And,

I've I've been at it ever since. So, really, over

over 30 years

working specifically in in data management

and, you know, 25 years now in data warehousing. So I I rapidly transitioned from the OLTP

world

to, data warehousing and business intelligence as soon as I was introduced to And for people who are coming from the

online transaction processing with relational schemas, they're probably at least somewhat familiar with the idea of 3rd normal form and having to break apart the different records in order to be able to keep them discreet and manageable and then joining across multiple tables.

Whereas in the data warehouse landscape, there are different approaches where you have different constraints and optimizations that you're trying to work for where you might denormalize things a little bit. And I know that 1 of the more popular approaches are the star and snowflake schema, which is, you know, snowflake being a slightly more special case version of the star schema as I understand it. And this is where data vault modeling in particular comes into play is in this data warehouse landscape. So I'm wondering if you can just give a bit of an overview

of what data vault modeling is and some of the ways that it differs from those other approaches such as 3rd normal form or the star and snowflake schemas. Sure. So the modeling technique itself

is is a pattern.

In Databolt, we have

several types of objects that we model,

hubs, links, and satellites.

Hubs are a really a list of what we call business keys or natural keys as anyone who's done 3rd normal form might might have referred to it as a natural key. And people who are familiar with dimensional modeling will know it as a durable

key. And so the hub

tables are just very narrow tables where we're just really listing out the the key business

identifiers for the data that we're interested in tracking in our data warehouse.

The links are relationships between those. Again, for folks who are familiar with

relational

design concepts,

it's often called a many to many intersection entity.

And so any relationship between

key objects that you're

seeking to track

are instantiated

in a linked table

inside of the Data Vault

approach.

And then the satellites

hang off of either hubs or links and those are all

descriptive attributes

that are directly related to either a hub or a link.

And the Data Vault approach was designed

very carefully in keeping with

Codd's relational theories

about

how you structure the data.

So in

normalization

theory, we talk about having

the data

be related

specifically to a key.

So the hub being the values

of all the keys for the objects

and then the satellites having the attributes that are fully dependent on that key. So

as an as an example, we'll say for Kent Graziano.

As as an individual, we'll say that that that's the key for a a person hub. My birth date is fully dependent on me, Kent Graziano,

not on anything else. Right? It's,

not it's not associated with, say, an order that I bought. Now they may a company I purchased something from may want to know what my birth date is for demographic tracking, but my birth date isn't really related to the order. It's related strictly to me.

So attributes like that will be in

in the in the satellite hanging off the hub,

where things that are related specifically

to the order, an order is gonna be a link in Data Vault where it's a relationship between a person and a product

and perhaps a company

and maybe even a salesperson. There may be multiple relationships involved that define an order. Well, attributes that are gonna be in a satellite off of the order are things like the quantity, the any special discounts that were specific to that order,

the taxes that were collected

on that particular

object that was sold. And so it all is in keeping with relational theory. So for people who are familiar with third normal form, Data Vault

is in many ways a higher level of normalization.

I had a conversation

with CJ

Date

who is

very famous for having written a number of books along with

EF Codd, the creator of the relational design. CJ wrote a lot of books with him and independently from him about relational design.

And we had a conversation 1 time about, well, when you look at Data Vault, what is it in terms of normalization? And he said, well, really, when he looks at it, hubs and links fall into the category of what's known as 6th normal form because they really are just keys. They're key values.

And then the satellites are 3rd normal form. And so it really is,

I'll say, an extension of 3rd normal form, but it's really, really further normalization

to to break the structures down to a more granular level, where on the dimensional side or the star schema side, we tend to do denormalizations,

where you've got facts, which are the metrics, the numbers that we're trying to do calculations against perhaps, and the dimensions

are all the descriptions about it. So you would have a customer dimension,

a product dimension, things like that. And within those dimensions, we tend to

collect a lot of attributes together

that are

important for analysis, and we refer to this sometimes as denormalization.

So it's not 3rd normal form. It's actually in many cases,

it's

not even first normal form. It would be it would be denormalized or non normalized,

and 1 of the characteristics of that is repeating values.

So say we have a

product dimension and some of the attributes of the product change over time,

it's possible you may have multiple rows in a table in the database where that product is there

3 or 4 times and different attributes have changed. But other attributes are exactly the same, and so it's been repeated. From a relational

perspective, the original design of 3rd normal form was to eliminate,

the potential for what we called update anomalies.

And that happened when you were, say, updating records

and you changed 1 value,

but say there were 5 other rows that that value was also in and you didn't get it. And that's in the denormalized

mode. So when things were in 3rd normal form,

the goal was to have each and every value

attribute only recorded once so that when somebody was looking at the system, there was no risk that there was gonna be conflicts.

Happens a lot in address matching. You know, if you've got the address stored in 5 different places in your system

and you only update 1 of them, well, then things don't come out right. Well, in data warehousing,

we try to collect all those changes, and so we're not as concerned about update anomalies.

We wanna collect the change, and in the Data Vault world, we actually track those changes. So we can say, oh, here. The address for Kent last year was this. The address for Kent this year is is something different. And we could see changes over time,

which allow us to do things like analytics on the data and do trend analysis. Yeah. Definitely. And particularly, the aspect of being able to track the history of values over time is something that often comes up in transactional databases as well because of wanting to maintain some sort of audit history.

And so it's worked around by having some sort of history table where you might have a trigger that actually replicates the row into the history table at the time that it gets updated, so you can try and do some tracking over time of how this value changed. Where in in a data warehouse, it's a much better place for actually keeping track of that information, but some places might not have a data warehouse, and so this is just a workaround to handle that aspect. Exactly. And then another piece that I find is interesting

is the further,

normalization

down to the key value pair so that you do have that granular look at some of these attributes for being able to be much more flexible on how you join across things

versus in 3rd normal form where it's a little bit coarser grand, and so you join

across a fewer number of tables, but you might be pulling in information that you don't actually care about for the purpose of that analysis. And so I'm curious how things like that actually impact some of the performance

and flexibility of the types of queries that you're able to execute in a Data Vault model as opposed to something like their normal form. Right. Now what you're touching on is exactly some of the reasons that Data Vault evolved was, you know, by definition, I'll say, you know, Bill Inman in his original definitions of data warehouse,

we talked about it being time variant. And that meant we would be able to see changes over time. And like you said, people were, many cases, were trying to do that in their operational systems and having performance issues because the operational system wasn't designed to handle all that data when you tried to run a report. If you wanted the history, you were having to join together multiple tables.

So we started evolving in the Datavault world to store all of that. And what happened is often people took the 3rd neural form model and put it into a data warehouse, which really that early days meant they put it on a different server in a different database and started collecting the history. Well, how are you gonna collect the history without

overwriting

things constantly? Well, they discovered, and this was Bill Inman's approach, was put a snapshot date on it. So now you had maybe a a customer table

that would have the same customer in it multiple times but with different snapshot dates. And so now you can start doing that trend analysis.

Well, over time,

you know, a number of things happen with that, from a performance perspective is depending on the platform you're on. You started kinda hitting the max capability to

aggregate and read all of that data

and got into a lot of, you know, say, fancy performance tuning and optimizations that the various database vendors

added in to handle this because most of the databases, well, they all started out as OLTP

and eventually they started adopting

and adapting for data warehousing to do different kinds of queries. So as the data got bigger and bigger and bigger, queries got slower, slower, and slower. So there there was that aspect to it. Then you started getting more specialized

technologies,

and appliances that perform better, were designed for this type of query. Now what has complicated it is then when you start having to make changes, and this is kinda where Data Vault comes in, is with the 3rd normal form approach with snapshot dates, the queries get very complicated. So if you had a master detail detail type of association

and you've got snapshot dates at all levels and data is changing at different rates in the different tables,

trying to put together the query to get a single point in time view. You know, what did that look like last week? You know, some of the data's changed. Some of the data hasn't changed. A lot of subqueries having to go on

to try to line that all up in betweens, and and that gets problematic, and it's very hard to maintain. The the other issue you should get is if you try to integrate multiple source systems with that kind of model, it's very hard to get a common model and get it aligned. And then the, you know, the worst case scenario is your source system changes,

And you've got this, you know, 5 levels deep with snapshot dates and and the source changes,

whether it's new attributes or new tables. And refactoring that model can be very expensive and very time consuming. So this was 1 of the problems

that Data Vault was really looking to solve. Then on the other side, you had in order to try to get performance better,

we have the dimensional modeling approach. And 1 of the things that I've seen over the, you know, bunch of years that I've been doing this is the dimensional model or star schemas as they're they're often called are really a, to me,

an articulation

of a pivot table in a database format. And many business analysts, they they think very well in pivot tables, and it's a very great paradigm for doing business analytics.

Well,

trying to do that on a 3rd normal form model just didn't work

in the older technologies.

So people started using dimensional models and doing the denormalization,

some pre aggregation, so we run a process

to put all this data together

in a form that's easier for a business user to consume, takes up a little more space because now we're introducing redundancies

in the dataset, but would often perform better because the queries are less complicated.

You know, a few fewer joins,

you know, in the process of building the dimensional model, you deal with all these snapshot dates and things like that. So downside of that was that you've got a fixed model. You've got a fact with, say, 5 dimensions. Now the same problem happens here. You get a new source system

and say your customer dimension has 20 attributes in it already.

It's just 1 source system. Now you get a second source system that has maybe 40 attributes,

some of which are

similar or equivalent to the 20 you already have and the other are brand new. Well, 2 things now. You have to if you need all that information for your analysis,

you need

to reengineer

or refactor that customer dimension, and that means adding a bunch of attributes. Well, that's great. So the the net new attributes coming from source system 2,

add them in,

modify your ETL process to populate those. Well, what about the first 20 that came from source system 1 that have equivalents in source system 2? Now you have to deal with

an order of precedence,

and I I did this in 1 of the very large data warehouses I worked on for HP years ago.

We had to decide, and it wasn't just 2, we had 5 source systems.

So we had to say for column 1, customer name, we can get it from 5 different source systems. Which name do we want in our reporting system? And so we had to write code

that said if source system 1 has the name, then we use it. If for some reason it's null in source system 2, then we will look at the we look at number 2, and then we look at number 3, and we look at number 4, and we look at number 5. And, of course, that's a very simple example. It was never actually that simple. But so there's a lot of coding involved

and having to not only refactor the design, but also refactor the load process. Lot

of

you know, terabytes of data and billions of rows of data. It takes a lot of time and a lot of cost to reengineer that. The approach with Data Vault, as you said, getting down to a more granular level, was designed to

eliminate that refactoring,

to be able to integrate new sources much quicker, but also have a a model that would be resilient, that it's gonna be business oriented,

business focused, so the business keys or natural keys, and integrating across those because you hope that, you know, across your your 5 source systems,

that there's a common set of attributes that you can align on. Now from a performance perspective,

1 of the things that that means what early days of Data

Vault, people looked at it and went, wow. That's a lot more tables,

especially if they were used to dimensional models. Dimensional models are are very clean. You know, fact table, 5 dimensions. Great.

You break that out into a data vault model. It could be I did did 1 of these 1 times where I had a dimension

that had 3 layers of hierarchy in the dimension. And by the time we finished the data vault model

that was behind that dimension, we had 25 tables. And so depending on the database engine you're working on, that may look like, well, I don't know how we're ever gonna get a query to perform to get a report out. And part of the Data Vault, and I'll say methodology now rather than modeling technique, but the Data Vault methodology

was that the Data

Vault is the core of your enterprise data warehouse. It really is the the history

of all the

data, time variant that you're you're trying to store. But for reporting, we still would tend to project those into dimensional models because most of the BI tools tend to think in terms of facts and dimensions. In the early days, again,

because performance was potentially an issue on that many joins. They're simple joins. They're not like the complex joins and trying to do the snapshot in their normal form, but there's a lot of them. And depending on the platform,

you might be able to make those work really well and just do views. This is particularly true in Snowflake.

We have a lot of customers that are doing that. They're able to just put a view on a bunch of Datavault tables, and it looks like a dimensional model. That's what gets exposed to the reporting

side. Previously, we had to we would just, you know, write a little ELT process that would take and turn that into the dimensional model and project that data from the data vault into a dimensional model again in order to get the performance. And 1 of the conclusions that I've come to over the years where people ask, like, well, you know, why did we have dimensional models? And it was twofold. 1 is a business perspective because it puts the data in a format that is easy to consume

by most business analysts. And the other was a performance 1, as as you kind of indicated earlier, is we needed to be able to have the queries be fast on large sets of data.

And it was simply because the underlying technology

couldn't handle the complex joins. With the lower level of granularity that you get with 3rd normal form and then even lower with Data Vault, it allows you

more flexibility

in the questions that you ask of the data model and your data warehouse.

You don't have to know all the questions to design the schema. You have to know the business and the business semantics and the context of the data to start building a data vault. Where with a dimensional model, you kinda need to know what questions are gonna be asked. So if somebody says, I need reports. I wanna know sales by

region,

by customer,

and by product.

Well, that's a particular dimensional model effect in a couple of dimensions.

So if you go to building all of that out and then they come back and say, oh, I also need to know the sales rep. Well, that's a different dimension, and that's potentially a different level of granularity in the fact table, and that means building or rebuilding that fact table and adding dimension to it. So it's, again, it's a refactoring or reengineering

that can be very time consuming,

especially if this is the only place you've stored the data. In the Data Vault world, you simply go back to your Data Vault and say, okay. Where's our information on sales rep? And then we can add that

dynamically into the model

and into the design of the reporting layer.

Again, giving us that much more flexibility to be responsive

to

changing business requirements.

And if we don't happen to have the sales rep information

because of the granularity

at which we model in Data Vault, we go get that information, and we model the hubs and links appropriate

to connect all that data together. And so even if we had requirements that we didn't know about,

we can evolve the model organically

without having to reengineer what we already did. And that's actually 1 of the huge benefits that I saw in in Data Vault,

especially as we moved into trying to do more agile approaches

to data warehousing,

and we've got, you know, pressure to deliver faster. How do we do that? And the data vault modeling technique is

the only 1 that I'd seen

that that really could accommodate that, that could allow you to

model in a very incremental manner,

be business aligned,

and and deliver and but not boxing yourself into a corner that's gonna require a lot of work if you didn't quite get it right the first time. Yeah. There's definitely a lot of great information to pull out from what you were just saying. So some of that being things like

in the dimensional approach, particularly in the, you know, early to mid nineties and maybe even into the early 2000 where a lot of the processes

to get data into data warehouses was heavily around these ETL workflows where there was some amount of data destruction that happened in order to be able to denormalize

and,

sort of canonize different records to load it into the database and things like master data management for being able to resolve some of the situations that you were talking about with having customer information across 5 different databases and which 1 is the right 1, which 1 do we want to pull it from,

And then also things like, how do we handle integrating

additional data sources

in a fashion that we're able to do it rapidly to be able to answer questions at the time that they're being asked rather than

5 months from now once I've actually finally gotten the transformations in place to load in those additional fact tables and build the additional reporting on top of that.

And so I can definitely see where

Data Vault is a

methodology or approach that has a lot of value for the current landscape of data where we do have an explosion of different sources that we're dealing with, not even all of them being native to within the business that we're working on where we have different SaaS vendors that we wanna be able to pull data out of to be able to enhance our reporting capabilities.

And so, I'm wondering if you can maybe give a bit more of the historical context of how Data Vault came to be and sort of the existing state of the art at the time and some of the pressures that led to the 2 0 version of the Data Vault methodology

spend so much of your time and focus and energy on actually helping to promote and teach data vault modeling. Sure. Yeah. It's always good to have historical perspective to know, you know, how how did we get here?

Well, the way we got here was really was, Dan Linstead, who's the inventor of of this approach

in the nineties.

He was doing big data before anybody knew what big data was. He was challenged with building

a large enterprise data warehouse for a US defense contractor.

He knew Bill Inman.

He was adept at both dimensional modeling and 3rd normal form modeling, but the

cost

of the refactoring

as the requirements

rapidly evolved

was problematic. And he was dealing with you know, he was in the terabyte range of data back then, and so trying to reload a dimension that's a terabyte was just took way, way, way too long. So he couldn't afford the the reengineering

that was required as his requirements were changing. So he set about trying to figure out, you know, mathematically,

what would be a different way to model these things that would give us that flexibility?

And

through 10 years of research and development from 1990 to 2000,

he evolved this approach that we now call Data Vault. And then in 2001,

he published a series of blog articles on

tdan.com,

introducing the world to Data Vault. I was pretty heavy into data warehousing at that time. I had, as I said, coauthored a book with Bill Inman.

I had been teaching Inman's

methodology

for creating enterprise data warehouse models. I had had experience with doing the Kimbell dimensional approach as well for various customers. I, you know, at the time, was a, was a consultant with with a a data consultancy

in Colorado.

And so I had experience with all both of the other approaches,

and I had the opportunity

to attend a lunch and learn that Dan gave introducing the Databall concepts.

And I could see what he was doing. He did a great job of explaining

the pros and cons of his approach versus 3rd normal form versus dimensional,

much of which I've already recounted to you. And so that that got me interested,

did a little more research,

asked him a lot of questions,

found out that he had been in contact with Bill Inman, so I had some conversations with Bill. And Bill, who really kind of invented the original way of doing data warehousing,

said, hey. I think Dan's got something here that this looks like it it does indeed solve the problems that, you know, people are running into,

particularly at scale. And in the early 2000, we were just starting

to to get into that

area, which we called VLDB,

very large databases,

which then eventually became very large data warehouses.

And then fast forward another decade, we're at,

big data. Right? We're talking about big data, and we've gone from talking about gigabytes and 100 of gigabytes of data to terabytes of data and then 100 of terabytes of data.

And lucky for, you know, many of us as practitioners, Dan

had you know, was in a in a industry that had that kind of data long before most of us ever saw it. And so it really did seem to solve solve these problems in a flexible way and really was pre agile. You know? Dan was trying to solve an engineering problem

to be cost effective and deliver business results. Well, 2001 comes along, and we have the Agile manifesto, and that's exactly what the agile software movement was all about. And so Dan was kind of ahead of the game with,

the thoughts

about how you deliver these things faster from a data perspective.

And so the 2 kind of blended together. And then I happen to be starting a new project at the time, so I had the opportunity to build a greenfield data warehouse

using this technique under Dan's mentorship with him checking my models,

checking our processes,

helping,

teach my team how to go about doing this, and it proved to be very, very effective. And because it was

so effective

and we became in the IT department I was in at the time, my team was the most productive team in the department,

We were delivering

changes on a weekly basis

to

a enterprise data warehouse

in, in the public sector, and so that certainly made a believer out of me. And in the process of that, you know, became friends with Dan

and

and helped him then write the first 2 books on Datavault because I did become, you know, a believer in this approach having done the other 2 prior to that

and seeing the architecture and the methodology

and and then seeing how it blended,

get introduced to agile at the same time and seeing how the 2 kind of blended together that we could actually build

an agile data engineering framework

using the Datapault approach.

And, you know, to your point about, you know, why have I invested so much time and energy into promoting the Data Vault approach? That's why. Because it's really about

changing

the perceptions of data warehousing and analytics as being these monolithic

long term projects that

really you know, depending on the statistics you've read over the years, you're talking

50 to 80% of those projects are deemed to fail. Well, the reason they mostly fail is because they're not delivering the business value

or they're not delivering the business value in a timely manner. So seeing any approach

that, you know, has been 6 I've been successful with for over a decade now in delivering quickly

and and being to able to adjust to change. That's why I talk about it, and I I tell people about it when they ask. I did do a lot of it even, you know, in my role at Snowflake. I'm I'm finding more and more of our customers

are looking for a better way,

And I ask them on a regular basis when somebody calls and says, hey, Kent. We we need you to talk to this team about Data Vault. I always start off with, well, why are you looking at Data Vault? What's brought you to this point? You know, often I'm talking to people who, like myself, may have a decade or more experience in doing data warehousing,

have had successful careers. They're at successful companies. And, you know, now they're moving to the cloud and they're gonna move to Snowflake, and they're looking at

Data Vault. It's like, why?

And then they recount to me many of the things that we talked about earlier as to where the issues they had in maintaining their their dimensional models or their 3rd normal form models and how fragile an ecosystem it was and how how difficult it was to make changes and apply changes. And the reality that everyone sees today is the world, the business world, in particular, is is changing rapidly. The data sources are changing. There's new data sources

every day. And companies that are in the forefront

of their industry,

they wanna be able to take advantage of that data,

which means they meet the from a data warehousing perspective and an analytics perspective, they need to be able to onboard that data quickly and get it into a form that people can access and do the analytics on. And they want it, of course, to not be completely independent

of the data they already have. They want to integrate it. And so the structure with hubs and links

allows companies to do that very quickly. And so I'm seeing a huge, I mean,

massive increase in interest in Data Vault in the last 3 years.

This past year, I have spoken with

over a 100 organizations

that some of which are already doing Data Vault in their current environments,

others who are investigating it because of the problems they're seeing, and they wanna take the opportunity

as they enhance and modernize their data e data ecosystem

to also enhance and modernize

their data engineering and their data architecture approach.

And 1 of the things that is interesting in what you mentioned in there earlier is the fact that you had the opportunity to build a greenfield data warehouse using Data Vault. And I'm curious what your experience has been in terms of trying to bring this methodology into an existing data warehouse that maybe already has dimensional model set up and then being able to

retrofit some of these hub and link tables and satellite tables into the overall table structure, and then maybe gradually replace some of those dimensional tables with views on top of the

underlying information that's held in the underlying tables that are modeled more in the Databolt methodology.

Yeah. It's, I'm seeing more and more of that. I actually

had a email last night from somebody

asking exactly about that, about a a customer that's now looking at taking their traditional data warehouse model

and and moving into a Data Vault style warehouse and ask, you know, what's the best way to do that? And what I've seen people do, you know, you can start with, you know, the dimensional model. If you understand the

Data Vault

concepts, you know,

you you really understand

the concept of business keys, you can

effectively reverse engineer a dimensional model into a Datavall model, and I've gone through exercises

like that. And in,

Dan's original coursework, the the first class I ever took from him back in the early 2000,

That was 1 of the exercises we did in the classes, you know, because people understood dimensional models. It's like, well, let's take this dimensional model and turn it into a Data Vault model so you can see the correlation. And, of course, if you can reverse engineer it from the dimensional to Data Vault, then projecting a dimensional on top of the Data Vault now becomes fairly obvious. And so that is 1 of the approaches people will will often take. I've also been in organizations that just went with,

let's start over and like to do a business model first and drive it truly from the business side without even looking at any of the source systems. And then we start doing the mapping of the source systems into the Data Vault. So it really kinda depends on, I'll say, the where the organization is in their life cycle and in their maturity. Now 1 of the keys though is there are some nuances

to data, and every system's a little bit different.

And I have found though that as simple as Data Vault looks on the surface, when you start getting into integrating multiple systems,

and if you have an organization that is not very mature in their data management and don't have, say, like, a a good semantic

dictionary

of business terms, that they they sometimes struggle in coming up with the right Data Vault model. They have an existing model. It's a little bit easier to back into the Data Vault model. But there's definitely challenges there where I find people really do need to go take Data Vault boot camp classes, get certified,

work with a Data Vault mentor,

to review it like I did. I mean, that's truly

I attribute the success I had

completely to the fact that I was able to have Dan as a mentor and a coach

working with my team. Now that means that the organization that you're working with has to be willing to fund the the ongoing training and professional development, coaching, and all of that. But that really is, you know, kind of the key to success with anything new, any new technology. I mean, you can figure out some of it on your own, but it's often good to have someone looking over your shoulder to make sure you don't make any any mistakes.

And then another interesting element

to explore is, for the most part, we've been focusing our conversation

on applying these methodologies

specifically to data warehouses and relational engines.

And the past decade plus, you know, maybe the past 20 years at this point, there has been a lot of churn in terms of what people are viewing as the state of the art, particularly for handling large volumes of data, but sometimes even small where things like Hadoop and data lakes have been coming more to the forefront, and there's been a bit of ebb and flow there.

There's been,

rise and somewhat fall in popularity of non relational engines or the so called no SQL movement,

and then things like semi structured or unstructured data being landed in different storage locations to then be analyzed or reformatted later and things like some of the new cloud data warehouses allowing for

ingesting some of the semi or unstructured data and then being able to

either do the transformations with the database engine using an ELT type approach or just building views on top of that data. And so I'm curious,

from a high level, how you view that as having impacted the overall

approach to data modeling and the state of the art in terms of how people think about data modeling,

but also if there is any

useful application of the Datavault methodologies

to something like a data lake or a non relational engine. Yeah. So, yeah, there's a couple of parts to that question.

Unfortunately,

with the advent of NoSQL about a decade ago,

a a little bit of a movement started to say that we don't need to do data modeling anymore. And, unfortunately, I'd say that that set us back a little bit, especially in the, analytics space. People started, you know, as soon as soon as a manager

hears that, oh, we don't need to do something, they tend to latch on to that because, oh, that means we can do we can move faster. Right? We can we can get there faster because I can eliminate all of this work we were doing.

And, unfortunately, the reality was is people were following a waterfall method rather than an agile approach

and taking way too long to do data modeling. The concept of schema on read for semi structured data in particular

is really it's a it's a great concept. And for application developers, it allowed them to be much more productive, deliver software faster because they're writing out a, you know, a, basically, a key value pair tag JSON document

into a document store

of some sort. And so they didn't have to do a lot of modeling upfront. People refer to it as upfront modeling. They could start writing their applications without having to go through trying to do a third normal form model, things like that, like you would have done in a relational database. So it was great great productivity boost. The downside of that is it's schema on read. Is when you go to

read the data and you wanna write reports on the data, you have to understand the design. So at some level, you just have to start interpreting that schema,

and so the the modeling aspect, if you will, got pushed down downstream to the output side. And, of course, data warehousing is all about the output and the analytics.

So somewhere along the line is a concept of a model,

and maybe it is a conceptual model, sometimes it's a physical model, has to be laid on top of that semi structured data. Now as you said, newer technologies have now evolved to do things like be able to write a SQL statement against something like that. And that's something that Snowflake is known for with our variant data type, and that does allow people to load

semi structured data in a schema less fashion

into

a table that has 1 column in it. Right? So there's not a lot of modeling involved there. You create a table, put a column in it, load the data in. But then to get it out, we write SQL against that to represent the data in a form that's consumable

by,

you know, either a, an analytics user or potentially by a data scientist

or, to be fed into a dashboard. And depending on what technology you're working with, you have a variety of of tools available to do that. So

the impact on modeling from this

is that we're now coming back around

to understanding that, well, modeling and the skill to do that is still valuable and is still needed. And in fact, even at a recent data modeling conference I was at, we had someone from Amazon talking about their move

from a structured relational database

to a a NoSQL style database for their operational systems.

And even the the lead architect who was talking about that said because of the nature

of the document store, he discovered he still needed to model. He still needed to think about how he structured those documents and how he organized them in the NoSQL in order to get the optimal performance for, say, the order entry system and certainly for doing the reporting afterwards,

that there was a lot of dependencies on how he organized the data as to how well it actually performed. So we're really back to, you know, a fit for purpose type of data model that he was having to do. So it is starting to make

a resurgence. This is the first time, in a decade that I've heard somebody talk about that, but I've heard 3 such talks in the last year now where they're saying, yeah. We really need to do some in this case, it's physical modeling in order to optimize that. Now where that flows over into this whole data lake concept and data vaults and where that's all related, I have seen,

and as recently as this week, talked with a prospect

that has implemented

a Data Vault methodology

in a new SQL type environment.

I even saw 1 at the Data Vault Conference,

in Germany end of last year. Somebody doing Data Vault in MongoDB and so it is still transferable because they're concepts, and it's about how you structure the data. So the ideas of hubs and links and how you do it physically

ends up being the question. Over, I'll say, 2 probably 2 years ago, at least that was more than 2 years ago.

Even at the Data Vault Conference, the annual Data Vault Conference that Dan runs,

there's been discussions

and demonstrations

of doing Data Vault on Hadoop. And I actually know of 1 customer, very large customer in Australia,

that half of their Data Vault is in Hadoop and half of their Data Vault is in Teradata. Now you earlier asked a question about Data Vault 2 and where did that came from. And that particular scenario is in part where Data Vault 2 came from. The idea of using

a hashed primary

key or primary key that's generated by applying a hash function

to the business key columns in the Data Vault model came about in that scenario because they had twofold, 2 issues. 1, in Datavault 1, we were using surrogate keys very similar to dimensional models, and they were having,

throughput issues with their sequence generator. They were loading data so fast that the bottleneck

was the sequence generator coming up with the next number. And so they got to thinking, well, how can we parallelize

this? Because we're using an MPP machine and we wanna use multiple threads and multiple CPUs. We need a different way to calculate the primary keys. But in tandem with that, they were also having some of the data was over in Hadoop. Well, Hadoop doesn't have a sequence generator anyway. And what they came around to was using an MD 5 hash because they could execute an industry standard MD 5 hash function on a set of attributes

in the relational side and on the Hadoop side and get exactly the same value. So they were able to form a logical join between the data on the hoop dupe side and the data in the

relational side. And that was 1 of the the part of the rationale for the evolution of Datavault 2 in particular

using the hash keys is now we could load very rapidly in parallel, basically, calculating the key rather than waiting for for a sequence generator to pop it up.

So we're now calculating a unique key on the fly, and that means we can load hubs, links, and satellites all in parallel rather than having to load all the hubs first to get the sequence, and then you can load the the subordinate

satellites

and the associated links. So you always had to have a hub first. Now since we're calculating the key with a hash function, we can load all these these things in parallel, which means you can take huge advantage of an MPP

style approach. So I have seen it done in in in both worlds,

adopting the sort of the, the Hadoop world. People were doing that because of the expense

of storing all that data in a relational system. So this is 1 way they offloaded the expense.

Now today,

with the evolution of Blob storage,

I think we've solved the expense problem. And Snowflake in particular because underlying is

is Blob storage and with our compression,

we've eliminated that particular barrier that there's no reason

to be building your data vault outside of a relational

environment like Snowflake because we have solved the cost problem. So that's come down. On the semi structured data side, you know, getting back to the the variant data type that we have in in Snowflake and how that relates to Data Vault, We've got some really interesting things going on with a couple of our customers.

Dan and I came up with this idea. I was like,

how do we deal with semi structured data in a Data Vault environment? Exactly the question you're asking. And our conclusion was, well, since all the dependent attributes go into a satellite, can we just put the JSON document in the satellite? And And so we still have a little bit of structure and we say, you know, you still have the primary key to the hub,

a load date, and a metadata tag for where the data came from. And then in the Snowflake world, we could put a variant column in there and load a JSON document into there. So as part of the ELT process, we'll pull out the business keys by querying that JSON and get the business keys out of that, put that in the hub so that we can easily find things, and that that really is your kind of, data vault index, if you will, on where all the data is is in the hubs

and joining down to the satellite.

And 1 of the customers that I'm working with has found that they have been able to, you know, accelerate the onboarding of data by at least tenfold because they're able to

they simply bring these JSON into a

what they call it, a data lake, a raw data lake inside of Snowflake

where they're just loading that data into

tables with the raw data with a variant column, and they load that in there. And then they go through their process of pulling out the the business keys, building hubs, and building links,

and then dropping the entire JSON document they'll drop into an attribute satellite, and they call that their raw data vault. And then they write views

on top of that for the business vault. So you may have, say, a JSON document that has a 100 keys in it. Well, the current requirement

in the current sprint only requires 3 of those attributes. So instead of trying to ETL all those at the 100 keys out into columns and do the transformations

sort of in a batch process, they just write views. And then the views

are dynamic. And so they've got requirements for 3 of those columns. The view only has 3 columns in it. Next sprint comes around, priorities change, whatever it is, and now we need 10 columns out of that JSON. Well,

their process now is modify the view and execute, and and they're good to go. So the turnaround time

to deliver value from the data with changing requirements

is is gone down from what probably was weeks or even months, I'll say in the old world, down to a couple of hours that they can do that so rapidly. And not having to parse all the data out on the way in means their ingestion

is very rapid, so they've got a lot of data streaming in. They're able to do something

they haven't been able to do before. They're able to load in near real time into the Data Vault. And since there's views sitting on top of the Data Vault, as fast as that data is getting loaded into the Data Vault, the next query that runs is able to see that data. So they've eliminated the latency in their data pipeline as well

and really accelerated their ability to, to deliver the data to the business users in a a much more timely manner and really be responsive to the the changing business environment.

So it's definitely easy to,

as we talk about this, start thinking that Datavault is the solution to all of my problems in terms of being able to handle

modeling and accessing and storing all of my data in a very agile fashion to get quick time to value. But what are some of the cases where the Data Vault approach doesn't really fit the needs of an organization or a use case, or

it's,

unnecessarily cumbersome

because of the size and maturity of the data or the, institution that's trying to implement it. Yeah. I mean, it's like everything in consulting

consulting. We always say it depends. Right? If you you know, smaller, less mature organizations

that don't have

a competency

in data architecture and data modeling certainly doesn't wanna start here.

If they don't understand their data,

you know, it's it's really hard to

come up with a good data vault model if you don't understand the data.

So starting with more raw data

and, you know, even putting views on that, pointing reporting tools at it is often a better way. If you don't have many data sources, if you've got 1 or 2 data sources

and your organization

is very familiar and comfortable with dimensional modeling,

then there's there's nothing wrong with doing that because it is all about delivering the value. And if you're able to deliver the value

using your current approach, then there's no reason to change. Because I I have seen, unfortunately, you know,

a number of data vaults that didn't work out when I talk to people, it usually is because

they really didn't understand the approach,

or they they took shortcuts because they didn't understand the approach, and they actually ended up in a in a worse off situation. They are truly trying to solve the problems that Data Vault was solved, but they didn't quite get there. And they were maybe a little premature in adopting that approach.

If you're dealing with things like

let's call it, you know,

true near real time

analytics where you've got IoT data and streaming data coming in and even as we talk about edge computing,

where you really you're you just gotta mine a couple of bits out of that data that's coming in and it's transient. You're not even gonna store it. Right? We certainly wouldn't wanna put a data vault in the middle of that, that that's gonna take time to design,

when you know it's only okay. We've got a a stream coming in from an IoT device and we know what we need off of that. And we need to be visualizing it as it comes in because it's it's critical data, and we're doing, like, real time monitoring.

Now doing doing a data vault in that case is not good. Now you may, on the other side of that, if there's data that's coming through that near real time stream, that you want to be able to do trend analysis

over time. So then you might consider building a Data Vault model on the other end, but not in the middle of the stream, if that makes sense. Right?

So, yeah, there's there's certainly situations where it's not gonna make sense.

There are, you know, any number of organizations out there that are making transitions again from on prem to the cloud. You know, my my

my basic stance is if it if it ain't broke, don't fix it. If the environment you're working in is serving the business needs

and you have

an SDLC process

in place that's agile enough for your organization, you're meeting your SLAs, great.

The reason you're moving to the cloud is to because it's growing and you need more space. You want the elasticity of the cloud.

You want the dynamic nature of something like Snowflake.

You wanna

want it as a service. You're you're you're wanting to eliminate some of the overhead of the management.

Well, that doesn't need to necessarily affect your data modeling technique. So that's always a you know, it's a it's a balance,

right,

of how are things working today,

where do you need to go tomorrow,

Is Data Vault the right approach to help you get there? And, you know, there's nothing wrong with saying no, that it's not,

but it is it is definitely use case specific. And,

you know, there are certain scenarios where, yeah, it's it's probably it's not a good idea. It may be over engineering

for some circumstances.

But as soon as you start to get to, we want to persist data over time, we know we've got 2 source systems. We know we're going to be getting new source systems. We know we're going to be getting mergers, and that's going to grow. And, systems. We know we're gonna be getting mergers,

and that's gonna grow. And if you've already experienced

issues with maintaining the approach that you're currently using, whether it be 3rd normal form, dimensional,

or completely denormalized,

well, then you would take a look at Databall and say, well, will this help us solve our problems, and what's a good road map for us to get there? Bill Inman has always said, no. Don't do big bang when he talks about data warehousing. And I'd say the same thing about Data Vault. You know? Don't don't try to change everything at once. Right? Don't don't

have too many moving parts. Do it incrementally.

Allow yourself time to to learn and grow

into the approach. Yeah. Boiling the ocean is never a, a good strategy.

Yeah.

Yeah. If you you know, like I said, I I was lucky in that the first endeavor I had with it was greenfield.

There was an ODS model, which I had designed myself. It was an integrated ODS,

model. It was a source system agnostic. So I already had the business keys.

We already figured that out in building out the ODS,

because we knew our source systems were gonna be changing. So we build an abstracted

ODS model.

So

having that and then being able to build a data warehouse on top of that using the Data Vault approach

was

was was great. I mean, it just worked out quite well to be able to do that, and I've been able to do that in a couple of industries over over my career.

So I know that that approach works, but I've also had to retrofit

that it is,

it's timing, and it's setting the right priorities from a business perspective as to what you're trying to accomplish.

And so

in terms of the foundational skills and experience and knowledge that are necessary for effective data modeling, what have you found to be some of the core elements?

And for listeners who want to learn more about data modeling in general or Data Vault specifically, what are some of the references or exercises that you recommend?

Sure. So, I I mean,

core fundamentals

in in in data modeling in normalization.

So understanding the concepts of normalization

are are great value.

I found that people who

have only done dimensional modeling sometimes struggle with going over State of Vault because

it flies to the face of what they've been trained

to do. They've been trained to do denormalizations,

and they think in terms of facts and dimensions. Like I said, it's very valuable from a business perspective to understand the data that way and present it that way.

Folks who came from,

the OLTP

background and have a background in their normal form,

Data Vault makes sense to them because it is just a slightly deeper level of normalization.

So for people who are trying to get into this,

there's any number of data modeling books out there you can find on Amazon

about

database design. David Hay has a great book. But,

there there's lots of books out there on on things like

data modeling. On on the data vault side, then you've got, supercharger data warehouse, which I I co wrote with,

with Dan Linstead.

And then

there is a

a Data Vault 2 dot o book as well.

And

these books, you can actually find links to them on my blog,

kentgraziano.com,

and you can find

links right to them. So supercharge your data warehouse and building scalable data warehouse with Data Vault 2 dot o.

And then I have a book that I did, which is an introduction to agile day data engineering

using Data Vault 2.0, which is available in either a Kindle format or a paperback format now. So those are good places to start. If you want just an overview of industry models, things like that, there is

the data model resource book by Lynn Silverstone.

I coauthored the first edition of that book with him and Bill Inman,

and that gives some, I'll say,

generic model examples

for things like people

in organizations,

you know, sales related, things like that, and has a a number of chapters then that walk through converting that into a a data warehouse model. In particular, it shows examples

of different levels of granularity, star schemas. If you're, you know, on the dimensional modeling side, there is, of course,

doctor Ralph Kimball's book, the data data warehouse toolkit for, you know, general introductions to the the data warehousing world, Bill Inman's original

book called building the data warehouse. It's probably in 7th edition at this point. On the out in the blogosphere,

you can, look up, Dan Linstead himself.

He's on Twitter, on LinkedIn.

He he has a blog. There is a a site dedicated to Data Vault and all called the Data Vault Alliance,

and that's

a online forum. There is some videos on there, some introductory videos, couple introductory articles, and some blog posts. Plus, there is a,

a forum you can subscribe to to get into deeper conversations

with practitioners.

And then a number of companies offer data vault training. So Dan Linstead, of course, himself

does

do data vault certification training.

Oftentimes, he does private classes with corporations who have,

large teams that they they wanna get started in Data Vault. There

is 1 of his partner companies called Performance

G2,

and they offer public Data Vault classes

about every other month at various places in the United States.

And anyone who follows me on Twitter,

will see me post when those come up. I I try to keep keep the community updated on on things like that.

And over in Europe, the company,

is Scale Free It's the name of the company that runs Databall classes over in Europe. That's also

a partner with, with Dan.

And then his certified training partner in Australia is Certus,

c

e r t u s. And all 3 of these companies, Scalefree,

Performance g 2, and Certus are also

Snowflake partners as well. So if you happen to be looking at Datavault

and Datavault in a in a Snowflake environment,

they can all help you with with that as well. And, of course, Dan is around around the world globally, so

he is

always available for,

consulting,

coaching, and mentoring on all things Data Vault. So are there any other aspects

of the Data Vault methodology

or data warehouse modeling or anything,

tangential to what we talked about today that you think we should discuss further? There is And

the

Data

Vault

world, and

in the Data Vault world. And you mentioned this very early on in our talk here about, you know, how we transform all the data and do all these ETL processes and things like that. In the Data Vault world is we we load the data raw. We take it from the source system and and put it into what we would call a raw vault in order to maintain that traceability and auditability. But because of the patterns involved in Data Vault, there are

data warehouse automation tools on the market that some of which are specialized

specifically in generating Data Vault style warehouses,

Others,

that do 3rd normal form, dimensional, and Data Vault. So that's another area for people to kinda

investigate is, data warehouse automation

and the, you know, the I'll say the for fear of leaving somebody out, number of companies that that have these things are Warescape.

Erwin

acquired a company in a year ago that has

data vault generation capabilities built into it. It's a data intelligence tool. There's a company called Vault Speed

and another 1 called Data Vault Builder. And then, over in Australia,

Verigence has a company has a product called Bimmelflex

that, generates

Data Vault style schemas as well. And there's others coming up all the time, consulting firms around the world who are doing data vault implementations.

Many of them are are developing their own tooling to, accelerate the time time to value of implementing

a Data Vault style warehouse.

So definitely worth looking into, especially if, you know, you're looking at Data Vault to try to be more agile

and more flexible in your approach, then throwing an automation tool or automation framework on top of that, just makes sense to me. Alright. Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, the the biggest gap right now is really on the on the data governance side and the data lineage side, especially with all of the

privacy regulations that are coming out, CCPA in California, GDPR over in Europe,

We have, I'll say, fairly

well known best practices for data governance and data management

that have been around for a decade or more.

People implementing

it has become the challenge and now people have to implement that,

trying to put together a full end to end tool set. You know, there's a couple of companies. Erwin Erwin has some data governance tools. There's a number of others out there that are there are stuff like partners as well. There's data cataloging from Alation. But trying to put that all together into a

solid end to end program. So, yeah, it's great that we're generating Data Vault models, and,

we've retained all the data.

We have all that data available for the lineage and traceability.

You know, putting the right tooling in place around that

so it makes it easy for people to find that data and ask the question and say, where did this data come from? Where where did the data go? You know, all of those sorts of things. And that's a challenge today, and I'm getting a lot of questions about that these days. How do people go about doing that?

It's something that's a problem that's being solved. It's been solved in part over the years, but now people really have to start paying attention to it and making sure it's part of their overall,

data management program. Well, thank you very much for taking the time today and sharing your experience and expertise on data vault modeling. It's definitely very interesting and useful approach to handling data, particularly in the current day and age of having so many different data sources that we have to deal with. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Alright. Thank you very much. Thanks for having me.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links