Data Modeling That Evolves With Your Business Using Data Vault - Episode 119


Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!

Altinity LogoEnabling real-time analytics is a huge task. Without a data warehouse that outperforms the demands of your customers at a fraction of cost and time, this big task can also prove challenging. But it doesn’t have to be tiring or difficult with ClickHouse — an open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable revenue. And Altinity is the leading ClickHouse software and service provider on a mission to help data engineers and DevOps managers. Go to to find out how with a free consultation.


  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to for a free consultation to find out how they can help you today.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape


  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?
    • What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome?
    • How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?
  • What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure?
  • What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses?
    • How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?
    • Is there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)?
  • What are the steps for establishing and evolving a data vault model in an organization?
    • How does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading?
  • What are some of the changes in query structure that consumers of the model will need to plan for?
  • Are there any performance or complexity impacts imposed by the data vault approach?
  • Can you talk through the overall lifecycle of data in a data vault modeled warehouse?
    • How does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model?
  • What are some cases where a data vault approach doesn’t fit the needs of an organization or application?
  • For listeners who want to learn more, what are some references or exercises that you recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?


The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy them. So check out our friends at linode. With 200 gigabit private networking, scalable shared block storage, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you'll get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And setting up and managing a data warehouse for your business analytics is a huge task, integrating real time Data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and costs you might expect. You deserve click house, the open source analytical database that deploys and scales wherever and whenever you want it to end turns data into actionable insights. And I'll tennety the leading software and service provider for click house is on a mission to help data engineers and dev ops managers tame their operational analytics. Go to data engineering slash l tennety. That's a L t i n ITY for a free consultation to find out how they can help you today. New listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers. You don't want to miss out on this year's conferences and we have partnered with our innovations such as O'Reilly Media trinium, global intelligence, od sc and data Council. Upcoming events include the software architecture conference in New York strata data in San Jose and pi con us and Pittsburgh. Go to data engineering slash conferences to learn more about these and other events center take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Kent Graziano again, this time about the data vault modeling approach and the role that it plays in the current data landscape. So Ken, can you start by introducing yourself?
Kent Graziano
Sure. I'm Kent Graziano. I am the Chief Technical evangelist for snowflake and I'm also a blogger, my blog, the data warrior. Over the years I have been involved in data management from the traditional relational databases and doing oil TP type modeling in getting into data warehousing in the mid 90s. I was a student of Bill Inman and Claudia Imhoff got my early start working with them and co authoring a book with with Bill on standard data models, then, eventually I did run into this method and approach called Data vault in the early 2000s, and became sort of a, I guess, a disciple or an acolyte of that particular approach to data warehousing and modeling, and have been working with that now since again, the early 2000.
Tobias Macey
And do you remember how you first got involved in the area of data management?
Kent Graziano
Yeah, actually, I started off as actually a scientific programmer doing basic and Fortran and was doing contract work with the US Department of Interior out Denver. Shortly after I got out of college, and I happened to get into a position with working With the Bureau of mines, and the manager for the consultancy that I was working with, came to me one day and said, Hey, I just came back from a seminar put on by IBM about this thing called relational database design. Would you be interested in learning about that? And, you know, sound, it sounded pretty interesting. I had been working with data one way or another for a number of years at that point. And he then went through sort of a tutorial with me about what relational was all about with relational theory was all about and then had me do a little pboc with a small database. After going through that we decided that that was the approach to solve one of the problems for the Bureau and decided to go pick up a early copy of Oracle was actually Oracle version five, and I got to install it, configure it and design the very first system that The Department of Interior used on a Oracle relational database. And it was kind of, you know, almost love at first sight for some reason, the other relational concepts and the modeling approach just appealed to me. And that the rest, as they say, is history. I mean, that was 19, a long time ago, in the last century, and I've been added ever since. So really, over, over 30 years, working specifically in in data management and, you know, 25 years now in data warehousing, so I rapidly transition from the OLTP world to data warehousing and business intelligence as soon as I was introduced to it,
Tobias Macey
and for people who are coming from the online transaction processing with relational schemas, they're probably at least somewhat familiar with the idea of third normal form and having to break apart the different records in order to be able to keep them Free and manageable and then joining across multiple tables. Whereas in the data warehouse landscape, there are different approaches where you have different constraints and optimizations that you're trying to work for where you might denormalize things a little bit. And I know that one of the more popular approaches are the star and snowflake schema, which is, you know, snowflake being a slightly more special case version of the star schema, as I understand it. And this is where data vault modeling in particular comes into play is in this data warehouse landscape. So I'm wondering if you can just give a bit of an overview of what data vault modeling is, and some of the ways that it differs from those other approaches, such as third normal form or the star and snowflake schemas?
Kent Graziano
Sure. So the the modeling technique itself, is is a pattern. In data vault we have several types of objects that we model hubs, links and satellites. hubs are a really a list of what we call business keys, or natural keys as anyone who's done third normal form might Might have referred to it as a natural key. And people who are familiar with dimensional modeling will know it as a durable key. And so the hub tables are just very narrow tables where we're just really listing out the key business identifiers for the data that we're interested in tracking and our data warehouse. The links are relationships between those. Again, for folks who are familiar with relational design concepts, it's often called a many to many intersection entity. And so any relationship between key objects that you're you're seeking to track are instantiated in a link table inside of the data vault approach. And then the satellites hanging off of either hubs or links. And those are all the descriptive attributes that are directly related to either a hub or a link. And the data vault approach was designed Very carefully in keeping with
qods relational theories about how you structure the data. So in normalization theory, we talked about having the data be related specifically to the key. So the hub being the values of all the keys for the objects and then the satellites having the attributes that are fully dependent on that key. So as it as an example will say, for Kent Graziadio as an individual I will say that that that's the key for a person hub my birthdate is fully dependent on me Kent Graziano not on anything else, right. It's not it's not associated with say an order that I bought. Now they may accompany I purchased something from may want to know what my birthdate is for demographic tracking, but my birthday isn't really related to the order, it's related strictly to me. So attributes like that will be in, in the hub in the satellite hanging off the hub where things that are related specifically to the order in order is going to be a link in data vault where it's a relationship between a person and a product and perhaps a company and maybe even a salesperson, there may be multiple relationships involved that define an order will attributes that are going to be in a satellite off of the order or things like the quantity the any special discounts that were specific to that order, the taxes that were collected on that particular object that was sold and so it all is in keeping with relational theory. So for people who are familiar with third normal form data vault is in many ways a higher level of normalization. I had a conversation with CJ date, who is very famous for having written a number of books along with EF Cod, the creator of the relay. Design CJ wrote a lot of books with him and independently from him about relational design. And we had a conversation one time about, well, when you look at data vault, what is it in terms of normalization? And he said, Well, really, when he looks at it hubs and links fall into the category of what's known as six normal form, because they really are just keys, their key values. And then the satellites are third normal form. And so it really is, I'll say, an extension of third normal form, but it's really, really further normalization, to break the structures down to a more granular level, where on the dimensional side or the stars game aside, we tend to do d normalizations, where you've got facts, which are the metrics, the numbers that we're trying to do calculations against, perhaps, and the dimensions are all the descriptions about it. So you would have a customer dimension, product dimension, things like that. And within those dimensions, we tend to collect a lot of astral together that are important for analysis. And we refer to this sometimes as D normalization. So it's not third normal form, it's actually in many cases, it's not even first normal form, it would be it would be denormalized, or non normalized. And one of the characteristics of that is repeating values. So say we have a product dimension and some of the attributes of the product change over time. It's possible that you may have multiple rows in a table in the database, where that product is there, three or four times and different attributes have changed, but other attributes are exactly the same. And so it's been repeated. From a relational perspective, the original design of third normal form was to eliminate the potential for what we call update anomalies. And that happened when you were saying updating records and you changed one value, but say the remaining five other rows that that value was also in and you didn't get it. And that's in the denormalized mode. So when things were in third normal form, the goal was to have each and every value attribute only recorded once. So that when somebody was looking at the system, there was no risk that there was going to be conflicts happens a lot in address matching, you know, if you've got the address stored in five different places in your system, and you only update one of them, well, then things don't come out, right? Well, in data warehousing, we try to collect all those changes. And so we're not as concerned about update anomalies. We want to collect the change. And in the data vault world, we actually track those changes. So we can say, oh, here, the address for Kent, last year was this. The address for Kent this year is is something different. And we can see changes over time, which allow us to do things like analytics on the data and do trend analysis.
Tobias Macey
Yeah, definitely. And particularly the aspect Being able to track the history of values over time is something that often comes up in transactional databases as well because of wanting to maintain some sort of audit history. And so it's worked around by having some sort of history table where you might have a trigger that actually replicates the row into the history table at the time that it gets updated. So you can try and do some that tracking over time of how this value changed, where in a data warehouse, it's a much better place for actually keeping track of that information. But some places might not have a data warehouse. And so this is just a workaround to handle that aspect. Exactly. And then another piece that I find is interesting is the further normalization down to the key value pairs so that you do have that granular look at some of these attributes for being able to be much more flexible and how you join across things. Versus in third normal form where it's a little bit coarser grained, and so you join across a fewer number of tables, but you might be pulling in information they don't actually care about for the purpose of that analysis. And so I'm curious how things like that actually impact some of the performance and flexibility of the types of queries that you're able to execute in a data vault model, as opposed to something like third normal form.
Kent Graziano
Right now, what you're touching on is exactly some of the reasons that data vault evolved was you by definition, say, you know, Bill in men in his original definitions of data warehouse, we talked about it being time variant, and that meant we would be able to see changes over time. And like you said, people were many cases were trying to do that and their operational systems and having performance issues because the operational system wasn't designed to handle that data. And when you tried to run a report, if you wanted the history, we're having to join together multiple tables. So we started evolving in the data vault world to store all of that and what happened is often people took the third normal form model and put it into a data warehouse which really that it's Early days men, they put it on a different server in a different database and started collecting the history. Well, how are you going to collect the history without overriding things constantly? Well, they discovered and this was Bill Simmons approach was put a snapshot date on it. So now you had maybe a customer table that would have the same customer in it multiple times, but with different snapshot dates. And so now you can start doing that trend analysis will over time, you know, number of things happen with that, from a performance perspective is depending on the platform you're on, you started kind of hitting the max capability to aggregate and read all of that data, and got into a lot of y'all say, fancy performance tuning and optimizations that the various database that vendors added in to handle this because most of the databases, well, they all started out as OLTP. And eventually they started adopting and adapting for data warehousing did you do four kinds of queries. So as the data got bigger and bigger and bigger queries got Slower, slower and slower. So there was that aspect to it, then you started getting more specialized technologies and appliances that perform better were designed for this type of query. Now, what has complicated it is then when you start having to make changes, and this is kind of where data vault comes in is with the third normal form approach with snapshot dates, the queries get very complicated. So if you had a master detail, detail type of association, and you've got snapshot dates at all levels, and data is changing at different rates in the different tables, trying to put together the query to get a single point in time view, you know, what did that look like last week, you know, some of the data is changed, some of the data hasn't changed a lot of sub queries having to go on to try to line that all up in betweens, and that gets problematic and it's very hard to maintain. The other issue you get is if you try to integrate multiple source systems with that kind of model. It's very hard to get a common model and get it aligned and then the you know, the worst case scenario Is your storage system changes. And you've got this, you know, five levels deep with snapshot dates, and in the source changes, whether it's new attributes or new tables and refactoring that model can be very expensive and very time consuming. So this was one of the problems that data vault was really looking to solve. Then on the other side you had in order to try to get performance better. We have the dimensional modeling approach. And one of the things that I've seen over the, you know, bunch of years that I've been doing this is the dimensional model or star schemas, as they're, they're often called a really a, to me a articulation of a pivot table in a database format. And many business analysts, they think very well in pivot tables, and it's a very great paradigm for doing business analytics will try to do that on a third normal form model just didn't work in the older technologies. So people started using dimensional models and doing the D normalization Some pre aggregation. So we run a process to put all this data together in a forum that's easier for a business user to consume takes up a little more space, because now we're introducing redundancies in the data set, but would often perform better because the queries are less complicated, you know, a few fewer joins you in the process of building the dimensional model, you deal with all the snapshot dates and things like that. So downside of that was that you've got a fixed model, you've got a fact, let's say, five dimensions. Now the same problem happens here, you get a new source system and say your customer dimension has 20 attributes in it already. It's just one source system. Now you get a second source system that has maybe 40 attributes, some of which are similar or equivalent to the 20 you already have and the other brand new Well, two things now you have to if you need all that information for your analysis, you Need to re engineer or refactor that customer dimension? And that means adding a bunch of attributes. Well, that's great. So the net new attributes coming from source system to add them in, modify your retail process to populate those. Well, what about the first 20, they came from sources to one that have equivalents and source system to now you have to deal with an order of precedence. And I did this in one of the very large data warehouses I worked on for HP years ago, we had to decide and it wasn't just two, we had five source systems. So we had to say for column one customer name, we can get it from five different source systems, which name do we want in our reporting system, and so we had to write code that said, If source system one has the name, then we use it. If for some reason it's Nolan source system two, then we will look at them. We look at number two and then we look at number three, and we look at number four, we look at number five. And of course, that's a very simple example. It was never Actually that simple, but so there's a lot of coding involved, and having to not only refactor the design, but also refactor the load process. And that's, again, very time consuming. And as you're getting into bigger and bigger data sets, you start talking about, you know, terabytes of data and billions of rows of data. It takes a lot of time and a lot of cost to re engineer that the approach with data vault, as you said, getting down to a more granular level was designed to eliminate that refactoring to be able to integrate new sources much quicker, but also have a model that would be resilient, that it's going to be business oriented business focus, so the business keys or natural keys and integrating across those because you hope that you know, across your, your five source systems, that there's a common set of attributes that you can align on now from a performance perspective. One of the things that that means what early days of data vault people looked at it went Wow, that's a lot more tasty. labels, especially if they were used as dimensional models. dimensional models are very clean, you know, fact table five dimensions great. You break that out into a data vault model it can be I did did one of these one times where I had a dimension that had three layers of hierarchy and the dimension. And by the time we finished the data vault model that was behind that dimension, we had 25 tables. And so depending on the database engineer working on, that may look like, well, I don't know how we're ever going to get a query to perform to get a report out and part of the data vault and I'll say methodology now rather than modeling technique, but the data vault methodology was that the data vault is the core of your enterprise data warehouse. It really is the history of all the data time variant that you're trying to store. But for reporting, we still attend to project those into dimensional models, because most of the BI tools tend to think in terms of facts and dimensions in the early days again, because performance was potentially an issue on that many joints are simple joints. They're not like the complex joints and trying to do the snapshot at third normal form. But there's a lot of them. And depending on the platform, you might be able to make those work really well and just do views. This is particularly true in snowflake, we have a lot of customers that are doing that, they're able to just put a view on a bunch of data vault tables, and it looks like a dimensional model. That's what gets exposed to the reporting side. Previously, we had to we would just, you know, write a little ELT process that would take a turn that into the dimensional model, and project that data from the data vault into a dimensional model, again, in order to get the performance. And one of the conclusions that I've come to over the years when people asked like, Well, you know, why did we have dimensional models, and it was twofold. One is a business perspective, because it puts the data in a format that is easy to consume by most business analysts. And the other was a performance one as you kind of indicated earlier, is we need needed to be able to have the queries be fast on large sets of data. And it was simply because the underlying technology couldn't handle the complex joints with the lower level of granularity that you get with third normal form, and then even lower with data vault, it allows you more flexibility in the questions that you ask of the data model and your data warehouse, you don't have to know all the questions to design the schema, you have to know the business and the business semantics and the context of the data to start building a data vault where with a dimensional model, you kind of need to know what questions you're going to be asked. So if somebody says, I need reports, I want to know sales by region by customer. And by product. Well, that's a particular dimensional model of fact, in a couple of dimensions. So if you go to building all of that out, and then they come back and say, Oh, I also need to know the sales rep. Well, that's a different dimension. And that's potentially a different level of granularity in the fact table and then means building or rebuilding that fact table and adding dimension to it. So it's again, it's a refactoring or reengineering, that can be very time consuming, especially if this is the only place you've stored the data in the data vault world, you simply go back to your data vault and say, okay, where's our information on sales rep. And then we can add that dynamically into the model and into the design of the reporting layer, again, giving us that much more flexibility to be responsive to changing business requirements. And if we don't have to have the sales rep information, because of the granularity at which we model in data vault, we go get that information. And we model the hubs and links appropriate to connect all that data together. And so even if we had requirements that we didn't know about, we can evolve the model organically without having to re engineer what we already did. And that's actually one of the huge benefits that I saw in data vault, especially as we moved in Trying to do more agile approaches to data warehousing. And we've got, you know, pressure to deliver faster. How do we do that? And the data vault modeling technique is the only one that I'd seen that that really could accommodate that, that could allow you to model in a very incremental manner, the business aligned and deliver and but not boxing yourself into a corner. That's going to require a lot of work if you didn't quite get it right the first time.
Tobias Macey
Yeah, there's definitely a lot of great information to pull out from what you were just saying. So some of that being things like in the dimensional approach, particularly in the you know, early to mid 90s, and maybe even into the early 2000s, where a lot of the processes to get data into data warehouses was heavily oriented around these ETL workflows where there was some amount of data destruction that happened in order to be able to denormalize and sort of canonize different records to load it into the database and things like Master Data Management for being able to resolve some of the situations that you were talking about with having customer information across five different databases, and which one is the right one? Which one do we want to pull it from? And then also things like, how do we handle integrating additional data sources? in a fashion that were able to do it rapidly to be able to answer questions at the time that they're being asked rather than five months from now once I've actually finally gotten the transformations in place to load in those additional facts tables and build the additional reporting on top of that, and so I can definitely see where data vault is a methodology or approach that has a lot of value for the current landscape of data where we do have an explosion of different sources that we're dealing with, not even all of them being native to within the business that we're working on where we have different sized vendors that we want to be able to pull data out of to Be able to enhance our reporting capabilities. And so I'm wondering if you can maybe give a bit more of the historical context of how data vault came to be and sort of the existing state of the art at the time and some of the pressures that led to the 2.0 version of the data vault methodology, and a bit more about your sort of investment in this approach, and why you're motivated to spend so much of your time and focus and energy on actually helping to promote and teach date about modeling. Sure,
Kent Graziano
yeah. It's always good to have historical perspective to know, you know, how did we get here? Well, the way we got here was really was Dan Lynn stat, who's the inventor of this approach in the 90s. He was doing big data before anybody knew what big data was. He was challenged with building large enterprise data warehouse for a US defense contractor. He knew billion men. He was adept at both dimensional modeling and turned on For my link, but the cost of the refactoring, as the requirements rapidly evolved, was problematic. And he was dealing with you know, he was in the terabyte range of data back then. And so trying to reload a dimension that's a terabyte was just took way, way way too long. So he couldn't afford the re engineering that was required as his requirements were changing. So he said about trying to figure out, you know, mathematically what would be a different way to model these things that would give us that flexibility. And through 10 years of research and development from 1992 2000 he evolved this approach that we now call data vault and then in 2001, he published a series of blog articles on T Dan calm introducing the world to data vault, I was pretty heavy into data warehousing at that time I had, as I said, co authored a book with billion men. I'd been teaching in mins method ology for creating enterprise data warehouse models I had had experience with doing the Kimball dimensional approach as well for various customers I, you know, at the time was a was a consultant with with a data consultancy in Colorado. And so I had experience with all both of the other approaches. And I had the opportunity to attend a lunch and learn that Dan gave Introducing the dateable concepts. And I could see what he was doing. He did a great job of explaining the pros and cons of his approach versus third normal form versus dimensional, much of which I've already recounted to you. And so that, that got me interested, did a little more research, asked him a lot of questions, found out that he had been in contact with Bill Inman. So I had some conversations with Bill and Bill who really kind of invented the original way of doing data warehousing said, Hey, I think Dan's got something here that this looks like it does indeed. So All of the problems that, you know, people are running into, particularly at scale. And in the early 2000, we were just starting to get into that area, which we called vldb, very large databases, which then eventually became very large data warehouses. And then fast forward another decade, we're at big data, right? We're talking about big data. And we've gone from talking about gigabytes and hundreds of gigabytes of data, two terabytes of data, and then hundreds of terabytes of data. And lucky for, you know, many of us as practitioners, Dan, had, you know, was in any industry that had that kind of data long before most of us ever saw. And so it really did seem to solve solve these problems in a flexible way. It really was pre agile, you know, Dan was trying to solve an engineering problem to be cost effective and deliver business results. Well, 2001 comes along, we have the Agile Manifesto, and that's exactly what the Agile Software movement was all about. And so Dan was kind of head of the game. With the thoughts about how you deliver these things faster from a data perspective, and so the two kind of blended together, and then I happened to be starting a new project at the time. So I had the opportunity to build a Greenfield data warehouse, using this technique under Dan's mentorship with him, checking my models, checking our processes, helping teach my team how to go about doing this, and it proved to be very, very effective. And because it was so effective, and we became in the IT department, I was in at the time, my team was the most productive team in the department. We were delivering changes on a weekly basis to a enterprise data warehouse, in, in the public sector. And so that certainly made a believer out of me. And in the process of that, you know, became friends with Dan and and helped him then write the first two books on data vault because I did become, you know, a believer in this approach, having done the other two Prior to that, and seeing the architecture in the methodology, and then seeing how it blended, I get introduced to agile at the same time and seeing how the two kind of blended together that we could actually build an agile data engineering framework using the data vault approach. And, you know, to your point about, you know, why have I invested so much time and energy into promoting the data vault approach? That's why, because it's really about changing the perceptions of data warehousing and analytics as being these monolithic, long term projects that really, you know, depending on the statistics you've read over the years, you're talking 50 to 80% of those projects are deemed to fail. Well, the reason they mostly fail is because they're not delivering the business value, or they're not delivering the business value in a timely manner. So seeing any approach that you know it's been six I've been successful with for over a decade now in delivering quickly and being able to adjust to Change. That's why I talk about it. And I tell people about it, when they ask, do a lot of it even in my role at snowflake, I'm finding more and more of our customers are looking for a better way. And I asked them on a regular basis when somebody calls and says, Hey, can we we need you to talk to this team about data vault? I always start off with Well, why are you looking at data vault? What's brought you to this point? You know, often I'm talking to people who, like myself may have a decade or more experience in doing data warehousing have had successful careers, they're at successful companies. And you know, now they're moving to the cloud, and they're going to move to snowflake. And they're looking at data vault like why. And then they recount to me many of the things that we talked about earlier as to where the issues they have in maintaining their their dimensional models or their third normal form models, and how fragile and ecosystem it was and how how difficult it was to make changes and apply changes. And the reality that everyone sees today is the world the business world, in particular is changing rapidly. The data sources are changing. There's new data sources every day, and companies that are in the forefront of their industry, they want to be able to take advantage of that data. Which means they mean the, from a data warehousing perspective and an analytics perspective, they need to be able to onboard that data quickly and get it into a form that people can access and do the analytics on. And they want it of course, do not be completely independent of the data. They are have. They want to integrate it. And so the structure with hubs and links allows companies to do that very quickly. So I'm seeing a huge, massive increase in interest in data vault in the last three years. This past year, I have spoken with over 100 organizations that some of which are already doing data vault and their current invited It's others who are investigating it because of the problems they're seeing. And they want to take the opportunity as they enhance and modernize their data, data ecosystem to also enhance and modernize their data engineering and their data architecture approach.
Tobias Macey
And one of the things that is interesting, and what you mentioned in there earlier is the fact that you had the opportunity to build a Greenfield data warehouse using data vault. And I'm curious what your experience has been in terms of trying to bring this methodology into an existing data warehouse that maybe already has dimensional model set up and then being able to retrofit some of these hub and link tables and satellite tables into the overall table structure and then maybe gradually replace some of those dimensional tables with views on top of the underlying information that's held in the underlying tables that are modeled more in the data vault methodology.
Kent Graziano
Yeah, it's, uh, I'm seeing more and more of that I actually had a email last night somebody's asking exactly about that about a customer that's now looking at taking their traditional data warehouse model and, and moving into a data vault style warehouse and asked, you know, what's the best way to do that. And what I've seen people do, you know, you can start with, you know, the dimensional model, if you understand the data vault concepts, you really understand the concept of business keys, you can effectively reverse engineer a dimensional model into a data vault model. And I've gone through exercises like that, and in Dan's original coursework, the first class I ever took from him back in the early 2000s. That was one of the exercises we did in the classes, you know, because people understood dimensional models, it's like, well, let's take this dimensional bottle and turn it into a data vault model. So you can see the correlation. And of course, if you can reverse engineer it from the dimensional to data vault, then projecting a dimensional on top of the data vault now becomes fairly obvious. And so that is one of the approaches people will will often take, I've also been in organizations that just went with, let's start over and like do a business model first and drive it truly from the business side without even looking at any of the source systems. And then we start doing the mapping of the source systems into the data vault. So it really kind of depends on say the where the organization is in their lifecycle and in their maturity. Now, one of the keys though, is there are some nuances to data and every systems a little bit different. And I've have found, though, that as simple as data vault looks on the surface, when you start getting into integrating multiple systems, and if you have an organization that is not very mature and their data management don't have a like, a good semantic Dictionary of business terms that they sometimes struggle in coming up with the right data vault model, they have an existing model, it's a little bit easier to back into the data vault model, but there's definitely challenges there were find people really You do need to go take data vault boot camp classes, get certified, work with the data vault mentor, to review it like I did. I mean, that's truly I attribute the success I had completely to the fact that I was able to have Dan as a mentor and a coach working with my team. Now that means that the organization that you're working with has to be willing to fund the the ongoing training and professional development and the coaching and all of that, but that really is, you know, kind of the key to success with anything new. Any new technology, I mean, you can figure out some of it on your own, but it's often good to have someone looking over your shoulder to make sure you don't make any, any mistakes.
Tobias Macey
And then another interesting element to explore is for the most part, we've been focusing our conversation on applying these methodologies specifically to data warehouses and relational engines and the past decade plus but you know, maybe the past 20 years at this point, there has been a lot of churn in terms of what people are viewing as the state of the art project. for handling large volumes of data, but sometimes even small, where things like Hadoop and data lakes have been coming more to the forefront, and there's been a bit of ebb and flow there, there's been a rise and somewhat fall in popularity of non relational engines or the so called no SQL movement. And then things like semi structured or unstructured data being landed in different storage locations to then be analyzed or reformatted later, and things like some of the new cloud data warehouses, allowing for ingesting some of the semi structured or unstructured data and then being able to either do the transformations with the database engine using an ELT type approach, or just building views on top of that data. And so I'm curious from a high level, how you view that as having impacted the overall approach to data modeling and the state of the art in terms of how people think about data modeling, but also if there is any useful application of the data vault methodology to something like A data lake or a non relational engine?
Kent Graziano
Yeah, there's a couple of parts of that question. Unfortunately, with the advent of NO SEQUEL about a decade ago, a little bit of a movement started to say that we don't need to do data modeling anymore. And unfortunately, I say that that set us back a little bit, especially in the analytics space, people started, you know, as soon as, as soon as a manager hears that, Oh, we don't need to do something, they tend to latch on to that because Whoa, that means we can do we can move faster, right? We can, we can get there faster, because I can eliminate all of this work we were doing. And unfortunately, the reality was is people were falling waterfall method rather than an agile approach. And taking way too long to do data modeling. The concept of schema on read for semi structured data in particular, is really it's a great concept. And for application developers, it allowed them to be much more productive, deliver software faster because they're writing out a, you know, basically a key value pair tag, JSON document. into a document store of some sort. And so they didn't have to do a lot of modeling upfront, people refer to it as a print modeling, they could start writing their applications without having to go through trying to do a third normal form model, things like that, like you would have done in a relational database. So it was great, great productivity boost. The downside of that is its schema on read is when you go to read the data, and you want to write reports on the data, you have to understand the design. So at some level, you have to start interpreting that schema. And so the the modeling aspect of you will got pushed down downstream to the output side. And of course, data warehousing is all about the output and the analytics. So somewhere along the line is a concept of a model. And maybe it is a conceptual model, sometimes it's a physical model has to be laid on top of that semi structured data. Now, as you said, newer technologies have now evolved to do things like be able to write a SQL statement against something like that, and that's something that snowflake is known for with our various Data Type. And that does allow people to load semi structured data in a schema list fashion into a table that has one column in it, right. So there's not a lot of modeling involved there, you create a table, element it load the data in, but then to get it out will write sequel against that to represent the data in a form that's consumable by, you know, either a, an analytics user or potentially by a data scientist, or to be fed into a dashboard. And depending on what technology you're working with, you have a variety of, of tools available to do that. So the impact on modeling from this is that we're now coming back around to understanding that will modeling and the skill to do that is still valuable and is still needed. And in fact, even at a recent data modeling conference I was at we had someone from Amazon talking about their move from a structured relational database to a A no SQL style database for their operational systems. And even the the lead architect who was talking about that said, because of the nature of the documents store, he discovered he still needed to model he still needed to think about how he structured those documents and how he organize them in the no SQL in order to get the optimal performance for, say, the order entry system. And certainly for doing the reporting afterwards, that there was a lot of dependencies on how he organized the data as to how well it actually performed. So it really back to, you know, a fit for purpose type of data model that he was having to do. So it is starting to make a resurgence is the first time in a decade that I've heard somebody talk about that. But I've heard three such talks in the last year now, where they're saying, Yeah, we really need to do some in this cases, physical modeling in order to optimize that now where that flows over into this whole data lake concept and data vaults and where that's all related, I have seen and as recently as this week, talk to With a prospect that has implemented a data vault methodology in a new SQL type environment, even side one at the data vault conference in Germany end of last year, somebody doing data vault in Mongo DB. And so it is still transferable because their concepts and so about how you structure the data. So the ideas of hubs and links and how you do it physically ends up being the question over I'll say to probably two years ago, at least I was more than two years ago, even at the data vault conference, the annual data vault conference that Dan runs, there's been discussions and demonstrations of doing data vault on Hadoop. And I actually know of one customer very large customer in Australia, that half of their data vault is in Hadoop, and half of their data vault is in teradek. Now, you earlier asked a question about data vault too. And where did that came from? In that particular scenario is in part where data vault two came from the idea of using a hashtag primary key or Primary Key that's generated by applying a hash function to the business key columns in the data vault model came about in that scenario because they had twofold two issues. One in data vault one, we were using surrogate keys, very similar two dimensional models, and they were having throughput issues with their sequence generator. They're loading data so fast that the bottleneck was the sequence generator coming up with an X number. And so they got to thinking, well, how can we parallelize this because we're using an MPP machine and we want to use multiple threads and multiple CPUs, we need a different way to calculate the primary keys. But in tandem with that, they were also having some of the data was over in Hadoop. Well, Hadoop doesn't have a sequence generator anyway. And what they came around to was using an empty five hash because they could execute an industry standard and be five hash function on a set of attributes in the relational side and on the Hadoop side and get exactly the same value. So they were able to form a logical join between the date on the hoop side and the data and the world. side. And that was one of the part of the rationale for the evolution of data vault to in particular, using the hash keys is now we could load very rapidly in parallel, basically calculating the key rather than waiting for sequence generator to pop it up. So we're now calculating a unique key on the fly. And that means we can load hubs, links and satellites all in parallel, rather than having to load all the hubs first to get the sequence. And then you can load the subordinate satellites and the associated link, so you always had to have a hub first. Now, since we're calculating the key with a hash function, we can load all these these things in parallel, which means you can take huge advantage of an MPP style approach. So I've seen it done in in both worlds adopting the sort of the Hadoop world people were doing that because of the expense of storing all that data in a relational system. So this is one way they offloaded the expense. Now today with the evolution of Blob Storage, I think we've solved the expense problem and snowflake in particular, because underlying is, is blob storage. And with our compression, we've eliminated that particular barrier, that there's no reason to be building your data vault outside of a relational environment like snowflake, because we have solved the cost problem. So that's come down on the semi structured data side, you know, getting back to the variant data type that we have in snowflake and how that relates to data vault, we got some really interesting things going on with a couple of our customers. Dan, and I came up with this idea of how do you deal with semi structured data in a data vault environment exactly the question you're asking. And our conclusion was, well, since all the dependent attributes go into a satellite, can we just put the JSON document in the satellite, and so we still have a little bit of structure, I would say, you know, you still have the primary key to the hub, a load date, and a metadata tag for where the data came from, and then in the snowflake world, We could put a variant column in there and load a JSON document into there. So as part of the ELT process will pull out the business keys by querying that JSON and get the business keys out of that, put that in the hub so that we can easily find things. And that really is your kind of data vault index, if you will, on where all the data is, isn't hubs. And joining down to the satellite, and one of the customers that I'm working with, has found that they have been able to, you know, accelerate the onboarding of data by at least tenfold because they're able to, they simply bring these JSON into a, what they call it a data lake, a raw data lake inside of snowflake, where they're just loading that data into tables with the raw data with a variant column, and they load that in there. And then they go through their process of pulling out the business keys, building hubs and building links, and then dropping the tire JSON document will drop into an attribute satellite and they call that They're raw data vault, and then they write views on top of that for the business ball. So you may have, say, a JSON document that has 100 keys in it. Well, the current requirement in the current sprint only requires three of those attributes. So instead of trying to ETL all those at 100, keys out into columns and do the transformations, sort of in a batch process, they just write views. And then the views are dynamic. And so they've got requirements for three of those columns. The view only has three columns in it. Next, Sprint comes around, priorities change, whatever it is, and now we need 10 columns out of that JSON will their process now is modify the view and execute and and they're good to go. So their turnaround time to deliver value from the data with changing requirements is gone down from what probably was weeks or even months, I'll say in the old world down to a couple of hours that they can do that so rapidly and not having to parse all the data out on the way in means their ingestion is very rapid. So they've got a lot of data streaming in, they're able to do something they haven't been able to do before, they're able to load in near real time into the data vault. And since there's views sitting on top of the data vault, as fast as that data is getting loaded into the data vault, the next query that runs is able to see that data. So they've eliminated the latency in their data pipeline as well. And really accelerated their ability to, to deliver the data to the business users in a much more timely manner, and really be responsive to the changing business environment.
Tobias Macey
So it's definitely easy to as we talked about this, start thinking that data vault is the solution to all of my problems in terms of being able to handle modeling and accessing and storing all of my data in a very agile fashion to get quick time to value. But what are some of the cases where the data vault approach doesn't really fit the needs of an organization or a use case or it's unnecessarily cumbersome the Because of the size and maturity of the data or the institution that's trying to implement it.
Kent Graziano
Yeah, I mean, it's like everything is consultant consulting, we always say it depends, right? If you, you know, smaller, less mature organizations that don't have a competency in data architecture and data modeling certainly doesn't want to start here. If they don't understand their data, you know, it's really hard to come up with a good data vault model if you don't understand the data. So starting with more raw data, and move and putting views on that pointing reporting tools at it is often a better way. If you don't have many data sources. If you've got one or two data sources, and your organization is very familiar and comfortable with dimensional modeling, then there's there's nothing wrong with doing that because it is all about delivering the value. And if you're able to deliver the value using your current approach, then there's no reason to change because I have seen Unfortunately, you know, a number of data vaults that didn't work out when I talked to people, it usually is because they really didn't understand the approach. Or they took shortcuts because they didn't understand the approach. And they actually ended up in and worse off situation, they are truly trying to solve the problems, that data vault was solved, but they didn't quite get there. And they were maybe a little premature and adopting that approach. If you're dealing with things like AWS, call it, you know, true near real time analytics, where you've got IoT data and streaming data coming in. And even as we talked about edge computing, where you really you just got to mind a couple of bits out of that data that's coming in, and it's transient, you're not even going to store it, right. We certainly wouldn't want to put a data vault in the middle of that, that that's going to take time to design when you know it's only Okay, we've got a stream coming in from an IoT device, and we know what we need off of that and we need to be visible analyzing it as it comes in. Because it's it's critical data and we're doing like real time monitoring, doing doing a data vault in that case is not good. Now, you may on the other side of that, if there's data that's coming through that near real time stream that you want to be able to trend analysis over time. So then you might consider building a data vault model on the other end, but not in the middle of the stream. That makes sense, right? So there's, there's certainly situations where it's not going to make sense. There are you know, any number of organizations out there that are making transitions again, from on prem to the cloud, you know, my, my basic stances, if it if it ain't broke, don't fix it. If the environment you're working in is serving the business needs, and you have an sdlc process in place that's agile enough for your organization, your meeting, your SLA is great. The reason you're moving to the cloud is too because it's growing and you need more support. base, you want the plasticity of the cloud, you want the dynamic nature of something like snowflake, you want to want it as a service, you're wanting to eliminate some of the overhead of the management, well, that doesn't need to necessarily affect your data modeling techniques. So that's always a you know, it's it's a balance, right? Of how are things working today? Where do you need to go tomorrow? Is data vault the right approach to help you get there? And, you know, there's nothing wrong with saying no, that it's not. But it is it is definitely use case specific. And, you know, there are certain scenarios where Yeah, it's, it's probably, it's not a good idea. It may be over engineering for some circumstances. But as soon as you start to get to we want to persist data over time. We know we've got to source systems, we know we're going to be getting new source systems, we know we're going to be getting mergers and that's going to grow and if you've already experienced issues with maintaining the approach that you're currently using whether it be their normal form dimensional or completely denormalized? Well, then you would take a look at data and say, well, will this help us solve our problems? And what's a good roadmap for us to get there? Bill admin has always said net don't do Big Bang, when he talks about data warehousing and I say the same thing about data vault, you know, don't don't try to change everything at once. Right. Don't don't have too many moving parts. Do it incrementally allow yourself time to to learn and grow into the approach.
Tobias Macey
Yeah, boiling the ocean is never a good strategy. Yeah,
Kent Graziano
yeah. If you know, like I said, I was lucky in that the first endeavor I had with it was Greenfield. There was an ODS model, which I had designed myself. It was an integrated ODS model, it was sort of system agnostic. So I already had business keeks we'd already figured that out and building out the ODS because we knew our systems were going to be changing. So we build it abstracted ODS model. So having that and then being able to build a data warehouse on top of that, using the data vault approach was was was great. I mean, it just worked out quite well to be able to do that. And I've been able to do that in a couple of industries over over my career. So I know that that approach works, but I've also had to retrofit that it is its timing, and it's setting the right priorities from a business perspective as to what you're trying to accomplish.
Tobias Macey
And so, in terms of the foundational skills and experience and knowledge that are necessary for effective data modeling, what have you found to be some of the core elements and for listeners who want to learn more about data modeling in general or data vault specifically what are some of the references or exercises that you recommend? Sure, so I mean,
Kent Graziano
core fundamentals in in in data modeling and normalization, so understanding the concepts of normalization are great value. I found it. People who have only done dimensional modeling sometimes struggle with me going over state of all because it flies in the face of what they've been trained to do. They've been trained to do de normalizations. And they think in terms of facts and dimensions, like I said, it's very valuable from a business perspective, to understand the data that way and present it that way. Folks who came from the old TP background and have a background in third normal form, data vault makes sense to them. It because it is just a slightly deeper level of normalization. So for people who are trying to get into this, there's any number of data modeling books out there you can find on Amazon about database design. David Haye has a great book but there's lots of books out there on on things like data modeling, and the data vault side. Then you've got supercharge your data warehouse, which I co wrote with the dam and Steph and Then there is a data vault to Dotto book as well. And these books, you can actually find links to them on my blog, kink rosianna calm and you can find links right to them. So supercharge your data warehouse and building scalable data warehouse with data vault to Dotto. And then I have a book that I did, which is an introduction to agile age data engineering, using data vault to data which is available in either Kindle format or paperback format now, so those are good places to start. If you want just an overview of industry models, things like that there is the data model resource book by Lynn Silverstein. I co authored the first edition of that book with him and Bill Inman, and that gives some I'll say, generic model examples for things like people in organizations, sales related things like that, and has a number of chapters than that. Walk through converting that into a data warehouse model in particular shows examples of different levels of granularity star schemas. If you're on the dimensional modeling side there is of course, Dr. Ralph Campbell's book the data data warehouse toolkit for you know general introductions to the the data warehousing world bill Edmunds original book called Building the data warehouse, it's probably in it, you know, seventh edition at this point on the out in the blogosphere, you can look up Damelin stet himself. He's on Twitter on LinkedIn. He has a blog there is a site dedicated to data vault now called the data vault Alliance. And that's a online forum. There is some videos on there some introductory videos, a couple introductory articles and some blog posts. Plus there is a forum you can subscribe to, to get into deeper conversations with practitioners. And then a number of companies offer data vault training, so Daniel instead of course himself does do data vault certification training. Oftentimes he does private classes with corporations who have large teams that they, they want to get started in data vault. There is one of his partner companies called performance jeetu. And they offer public data vault classes about every other month at various places in the United States. And anyone who follows me on twitter will see me post when those come up by I try to keep keep the community updated on on things like that. And over in Europe, the company is scale free is the name of the company that runs data vault classes over in Europe, that's also a partner with with Dan, and then his certified training partner in Australia is service car to us. And all three of these companies scale free performance jitsu and service are also snowflake partners as well. So if you happen to be looking at data balls And data vault in a in a snowflake environment they can all help you with with that as well. And of course Dan is around the around the world globally. So, he is always available for consulting, coaching and mentoring on all things data vault.
Tobias Macey
So are there any other aspects of the data vault methodology or data warehouse modeling or anything tangential to what we talked about today that you think we should discuss further.
Kent Graziano
There is another aspect of this because data vault is its pattern based. And there is a methodology that goes with it and their approach to loading the data. That's, that's very straightforward. And the data vault world and you mentioned this very early on in our talk here about, you know, how we transform all the data and do all these ETL processes and things like that, in the data vault world is we load the data raw, we take it from the source system and put it into a what we would call a raw vault, in order to maintain that traceability and auditability but because of the patterns involved, In data vault, there are data warehouse automation tools on the market that some of which are specialized, specifically in generating data vault style warehouses, others that do third normal form dimensional and data vault. So that's another area for people to kind of investigate his data warehouse automation. And the you know, the I'll say the fear of leaving somebody out number of companies that that have these things are where escape. Erwin acquired a company in a year ago that has data vault generation capabilities built into it. It's a data intelligence tool. There's a company called vault speed, and another one called Data vault builder. And then over in Australia, vergence has a company has a product called bimmel flex, that generates data vault style schemas as well and there's others coming up all the time consulting firms around the world who are doing data implementations, many of them are developing their own tooling to accelerate the time time to value of implementing a data vault style warehouse. So definitely worth looking into, especially if you're looking at data vault to try to be more agile and more flexible in your approach, then throwing an automation tool or automation framework on top of that just makes sense to me.
Tobias Macey
All right. Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Kent Graziano
You know, the biggest gap right now is really on the on the data governance side and the data lineage side, especially with all of the privacy regulations that are coming out ccpa and California GDPR. over in Europe, we have, I'll say a fairly well known best practices for data governance and data management that have been around For for a decade or more people implementing it, it has become the challenge. And now people have to implement that trying to put together a full end to end toolset. You know, there's a couple of companies Irwin, Aaron has some data governance tools. And there's a number of others out there that are there stuff like partners as well. There's data cataloguing from elation, but trying to put that all together into a solid end to end program. So, yeah, it's great that we're generating data vault models, and we've retained all the data, we have all that data available for the lineage and traceability, you know, putting the right tooling in place around that. So it makes it easy for people to find that data and ask the question, say, Where did this data come from? Where Where did the data go, you know, all of those sorts of things, and that's a challenge day. I'm getting a lot of questions about that these days. You know, how do people go about doing that? It is it's something that's a problem that's that's being solved. It's been solved in part over the years. But now people really have to start paying attention to it and making sure it's part of their overall data management program.
Tobias Macey
Well, thank you very much for taking the time today and sharing your experience and expertise on data vault modeling. It's definitely a very interesting and useful approach to handling data, particularly in the current day and age of having so many different data sources that we have to deal with. So thank you for all of your time and effort on that and I hope you enjoy the rest of your day.
Kent Graziano
All right, thank you very much. Thanks for having me.
Tobias Macey
Listening, don't forget to check out our other show it at python To learn about the Python language, its community in the innovative ways that is being used. And visit the site at data engineering podcast com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering with your story and to help others People find the show please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!