Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.
Enabling real-time analytics is a huge task. Without a data warehouse that outperforms the demands of your customers at a fraction of cost and time, this big task can also prove challenging. But it doesn’t have to be tiring or difficult with ClickHouse — an open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable revenue. And Altinity is the leading ClickHouse software and service provider on a mission to help data engineers and DevOps managers. Go to dataengineeringpodcast.com/altinity to find out how with a free consultation.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape
- How did you get involved in the area of data management?
- Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?
- What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome?
- How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?
- What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure?
- What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses?
- How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?
- Is there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)?
- What are the steps for establishing and evolving a data vault model in an organization?
- How does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading?
- What are some of the changes in query structure that consumers of the model will need to plan for?
- Are there any performance or complexity impacts imposed by the data vault approach?
- Can you talk through the overall lifecycle of data in a data vault modeled warehouse?
- How does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model?
- What are some cases where a data vault approach doesn’t fit the needs of an organization or application?
- For listeners who want to learn more, what are some references or exercises that you recommend?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Data Vault Modeling
- Data Warrior Blog
- OLTP == On-Line Transaction Processing
- Data Warehouse
- Bill Inmon
- Claudia Imhoff
- Oracle DB
- Third Normal Form
- Star Schema
- Snowflake Schema
- Relational Theory
- Sixth Normal Form
- Pivot Table
- Dan Linstedt
- Ralph Kimball
- Agile Manifesto
- Schema On Read
- Data Lake
- Data Vault Conference
- ODS (Operational Data Store) Model
- Supercharge Your Data Warehouse (affiliate link)
- Building A Scalable Data Warehouse With Data Vault 2.0 (affiliate link)
- Data Model Resource Book (affiliate link)
- Data Warehouse Toolkit (affiliate link)
- Building The Data Warehouse (affiliate link)
- Dan Linstedt Blog
- Perforrmance G2
- Scale Free European Classes
- Certus Australian Classes
- Data Vault Builder
- Varigence BimlFlex