Data Vault modeling for the agile data warehouse.

Data and context

Data modeling for the agile data warehouse

The requirements for a modern data warehouse are easy to list: Data should be able to be integrated quickly and correctly and it should also be possible to react flexibly to major changes in content. The Data Vault method created by Dan Linstedt promises to deliver these benefits.

In the past, the discussion has mainly taken place between proponents of dimensional modelling and those of normalized modelling. In recent years, however, the Data Vault method created by Dan Linstedt seems to be able to fulfill the desired requirements of a modern DWH.

A look back

In many large data warehouse implementations, modeling using the third normal form is often preferred. Incidentally, this corresponds to the ideas of Bill Inmon, one of the "forefathers" of the data warehouse concept. The aim of the normal form was to create a comprehensive, company-wide data model for the core data warehouse. In addition to the time-consuming setup, the further development of the data model in particular is associated with a great deal of effort and long project durations.

Instead, unlike Inmon, Ralph Kimball has been proposing the development of a dimensional data model for the core data warehouse for more than 20 years. Accordingly, it is organized with a star or snowflake modeling data in dimension and fact tables. Both the initial setup and the iterative further development are relatively quick and close to the understanding of the business users.

Although a major criticism of dimensional modeling for the core data warehouse is the lack of robustness against changes to the source systems. The business logic is also criticized, as it usually involves extensive and costly modifications to the data model, which is why dimensional modeling is mainly used at the data mart level today.

The concept of the data vault promises to overcome the disadvantages of classic modeling methods. This is due to the basic idea of the approach of dividing up information in the data warehouse in such a way that simple integration and historization of the data is possible. The model can also be expanded without migrating the existing tables.

Elements of a data vault

In contrast to the previous approaches, which store the data in the 3rd normal form, data vault modeling divides all information belonging to a business concept, for example customer or product, into three categories and stores it in three types of database tables.

Hubs

In this sense, the hub category includes all information that uniquely describes a business concept, i.e. gives it its identity (e.g. customer numbers for customers). Hubs therefore represent the core objects of the respective business logic. They are used to identify the functional entities and contain an artificial primary key (surrogate key) in addition to a functional key. They also contain technical attributes such as information about the source system or the loading date. Accordingly, a hub can be described as a list of unique business keys that serve as an integration point for data from different sources.

Links

The Link category contains all types of relationships between business concepts, for example the assignment of a product to a manufacturer. This means that links are used to link (two or more) hubs and use the surrogate keys for this purpose. Relationships in a data vault are always modeled as n:n relationships and stored in a separate relationship entity. Ultimately, a future change in the cardinality of the relationship therefore does not lead to a change in the data model.

Satellites

The Satellite category contains all attributes that describe a business concept or a relationship (e.g. name or age of a customer). A hub or link can have several satellites. This means that the data source and loading time are also recorded for the satellites. The loading date, together with the hub reference, is the primary key of the table. As a result, a complete historization of the data records is possible.

Separation between hubs, links and satellites

The linking of the individual entities is always based on modeling via links that refer to hubs. A direct link between the individual hubs is not permitted within Data Vault. Foreign keys in a satellite that point to a foreign hub are also prohibited.

The strict separation between hubs, links and satellites and compliance with the specified rules is essential for the flexible integration of data from multiple source systems and the extensibility of the data model.

If new data is added to an existing data model, this does not change the existing Data Vault model. If new attributes are added to an existing hub, they are stored in new satellites. Completely new business objects are added via a link. This has the advantage that there is no need to test existing processes, as these are not affected by the changes.

ETL processes for Data Vault

The strict separation between hubs, links and satellites is essential for the integration of data from multiple sources and the extensibility of the data model.

This is because the data loading process is based on standardized and simple patterns. The corresponding ETL processes consist of key lookups, INSERT statements on hubs, links and satellites, as well as specific delta determinations. The different tables of a type are loaded independently of each other, so that parallelized execution is possible.

The entire loading process consists of three steps:

All hubs are loaded in parallel
All links and hub satellites are loaded in parallel
All link satellites are loaded in parallel

On the other hand, extracting data into dimensional data marts is more complex than loading the data vault. This is because several hubs, links and satellites sometimes have to be merged in order to load a dimension or fact table. Furthermore, an additional challenge arises in this context, especially when loading slow-changing dimensions of type 2. For hubs with several satellites, this means that all validity intervals resulting from the linking of the validity of the individual satellites must first be determined.

The validity intervals are stored in another satellite table. The different loading times are linked to the corresponding validity in this table. Once this step has been implemented, the valid versions of the individual satellites can be determined and the dimension tables can be filled correctly.

The ETL processes for the individual structural elements (hubs, links and satellites) are otherwise structured according to the same pattern. This results in a recurring logic for the loading processes of the data vault. They can also be implemented relatively easily using suitable generators. This then makes the adaptation process for structural changes in large data warehouses much easier. Due to the uniform rules for modeling and loading paths, it can be observed in practice that tables and ETL processes can be generated comparatively easily.

Advantages of Data Vault in practice

The efficient handling of the increasing flood of data is increasingly becoming a key success factor for companies. The requirements of the various specialist departments and the database change within a very short space of time. This means that large volumes of data from a wide range of different, sometimes unstructured data sources have to be collected and processed into decision-relevant information for specialist departments and management.

It is therefore becoming increasingly difficult for IT departments to react quickly and flexibly to this data growth and to integrate new data sources into existing data warehouses. Data models must adapt to the requirements of big data. As soon as data from social media or information from customer cards or smart metering systems is integrated, for example, it is clear that this can only be possible with agile modeling methods.

Consequently, the concept of the Data Vault aims to provide an approach that is able to counter these developments and reduce complexity. The method is therefore particularly suitable for data warehouses in which structural enhancements have to be made frequently and for projects in which an agile approach is chosen from the outset. The aim of modeling with Data Vault is to create efficiently expandable data warehouses.

Finally, the experiences from our projects in the area of Data Vault can largely be described as positive. In particular, as the data modeling process can be improved with this approach. This can result in the following benefits for companies:

Reduction of complexity
Automation of processes
Faster charging processes
Historization simplifies traceability
Scalability enables flexible growth

Consequently, Data Vault is particularly recommended for three use cases: for companies that

have to load very large volumes of data within a short period of time, or
strive for agile development, or
want to implement an upstream core data warehouse as part of an existing silo architecture.

‍

Request free initial consultation now

No items found.

Further topics and advice on data and analytics