Data Vault Modeling

Data modeling for the agile data warehouse
The requirements for a modern data warehouse are quickly listed: Data should be able to be integrated quickly and correctly and it should be possible to react flexibly even to major changes in content. The Data Vault Method created by Dan Linstedt promises to implement these benefits.
In the past, the discussion has primarily taken place between proponents of dimensional modeling and those of normalized modeling. But in recent years, the Data Vault method created by Dan Linstedt seems to be more likely to meet the desired requirements of a modern DWH.
A look back
Many large data warehouse implementations often prefer modelling using the third normal form. By the way, this is in line with the ideas of Bill Inmon, one of the “forefathers” of the data warehouse concept. The normal form had the goal of building a comprehensive, company-wide data model for the core data warehouse. In addition to the time-consuming set-up, the further development of the data model in particular involves a great deal of effort and long project runtimes.
Instead, in contrast to Inmon, Ralph Kimball has been proposing the development of a dimensional data model for the core data warehouse for more than 20 years. Accordingly, Star or Snowflake modeling organizes data into dimension and fact tables. Relatively close to the understanding of the professional users, both initial set-up and iterative development are carried out relatively quickly.
Although the main criticism of dimensional modelling for the core data warehouse is the lack of robustness against changes to the source systems. Business logic, which usually involves extensive and complex modifications in the data model, is also criticized, which is why dimensional modelling is therefore used primarily at the data marts level today.
The Data Vault concept promises to overcome the disadvantages of traditional modelling methods. This is due to the basic idea of the approach of dividing information in the data warehouse in such a way that easy integration and historization of the data is possible. This also allows the model to be extended without migrating existing tables.
Elements of a Data Vault
In contrast to previous approaches, which store data in the third normal form, data vault modelling divides all information relating to a business concept, for example customer or product, into three categories and stores them in three types of database tables.
Hubs
In this sense, the Hub Category includes all information that uniquely describes a business concept, i.e. gives it its identity (for example customer numbers for customers). Hubs therefore represent the core objects of the respective business logic. They are used to identify professional entities and, in addition to a functional key, contain an artificial primary key (surrogate key). They also contain technical attributes such as information about the source system or the loading date. Accordingly, a hub can be described as a list of unique business keys that serve as an integration point for data from various sources.
left
All types of relationships between business concepts, such as assigning a product to a manufacturer, are stored in the link category. This means that links are used to connect (two or more) hubs and use the surrogate keys for this purpose. Relationships in a data vault are always modeled as n:n relationships and stored in their own relationship entity. Finally, a future change in the cardinality of the relation therefore does not result in a change in the data model.
satellites
The Satellite Category includes all attributes that describe a business concept or relationship (for example, the name or age of a customer). A hub or link can have multiple satellites. This also means that satellites also contain information about the data source and charging time. The load date, together with the hub reference, is the primary key of the table. As a result, a complete historization of the data records is possible.
Separation between hubs, links, and satellites
The linking of the individual entities is always based on modelling via links that point to hubs. Within Data Vault, a direct connection between the individual hubs is not allowed. Foreign keys that point to a foreign hub are also prohibited in a satellite.
The strict separation between hubs, links and satellites and compliance with the specified rules is essential for the flexible integration of data from multiple source systems and the expandability of the data model.
If new data is added to an existing data model, it does not change the existing data as part of the Data Vault model. If new attributes are added to an existing hub, they are stored in new satellites. Completely new business objects are added via a link. This has the advantage that there is no need to test existing processes, as they are not affected by the changes.
The requirements for a modern data warehouse are quickly enumerated: Data should be able to be integrated quickly and correctly and it should be possible to react flexibly even to major changes in content. The Data Vault Method created by Dan Linstedt promises to implement these benefits.
In the past, the discussion has primarily taken place between proponents of dimensional modeling and those of normalized modeling. But in recent years, the Data Vault method created by Dan Linstedt seems to be more likely to meet the desired requirements of a modern DWH.
ETL Processes for Data Vault
The strict separation between hubs, links, and satellites is essential for the integration of data from multiple sources and the expandability of the data model.
Because the data loading process is based on Uniform and simple patterns. The corresponding ETL processes consist of key lookups, INSERT statements on hubs, links and satellites, as well as specific delta calculations. In this case, the various tables of one type are loaded independently of each other, so that parallelized execution is possible.
For this purpose, the entire loading process consists of three steps:
- All hubs are charged in parallel
- All links and hub satellites are loaded in parallel
- All Link Satellites are loaded in parallel
By contrast, extracting data into dimensional data marts is more complex than loading the data vault. This is because in order to load a dimension or fact table, several hubs, links and satellites must sometimes be brought together. Furthermore, there is an additional challenge in this context, especially when loading Type 2 Slowly Changing Dimension. For hubs with multiple satellites, this means that all validity intervals resulting from linking the validity of the individual satellites must be determined first.
The validity intervals are stored in another satellite table. Here, the different charging times are linked to the associated validity period. Once this step has been implemented, the currently valid versions of the individual satellites can be determined and the dimension tables filled correctly.
The ETL processes for the individual structural elements (hubs, links and satellites) are otherwise structured according to the same pattern. As a result, a recurring logic is created for the data vault loading processes. They are also relatively easy to implement using suitable generators. This then makes the adjustment process noticeably easier when there are structural changes in large data warehouses. Due to the uniform rules for modelling and loading routes, it can be observed in practice that tables and ETL processes are comparatively easy to generate.
Benefits of Data Vault in Practice
Efficient handling of the increasing flood of data is increasingly becoming a key success factor for companies. The requirements of the various specialist departments and the database change within a very short period of time. As a result, large amounts of data must be collected from a wide variety of, sometimes unstructured data sources and processed into decision-relevant information for specialist departments and management.
It is therefore becoming increasingly difficult for IT departments to react quickly and flexibly to this data growth and to integrate new data sources into existing data warehouses. Data models must adapt to the requirements of big data. As soon as data from social media or information from customer cards or smart metering systems is integrated, for example, it is understandable that this can only be possible with agile modelling methods.
As a result, the concept of Data Vault aims to provide an approach that is able to address these developments and reduce complexity. The method is therefore particularly suitable for data warehouses where Structural expansions must often be made and for projects in which an agile approach is chosen from the outset. The aim of modeling with Data Vault is to efficiently expandable data warehouses.
Recently, the experiences from our projects in the area of Data Vault can be largely described as positive. Especially since the approach can be used to improve the process of data modelling. This can result in the following benefits for companies:
- Reducing complexity
- Automate processes
- Faster charging processes
- Historization simplifies traceability
- Scalability enables flexible growth
As a result, Data Vault is particularly recommended for three use cases: for companies that
- Need to load very large volumes of data within a short period of time, or
- Strive for agile development, or
- Want to implement an upstream core data warehouse as part of an existing silo architecture.

%20(1).webp)


