Data storage options

Data Warehouse vs. Data Lake vs. Data Mart
If you want to store large amounts of data, you need to consider not only the location — on-premise or off-premise — but also the form: Data Warehouse, Data Lake and Data Mart are the three most common data repositories to combine a variety of different sources on one platform. In conversation with our data expert Frederic Bauerfeind, we discuss why and when which storage variant is the right choice and how it is integrated into the modern data stack.
A data warehouse is a relational database system for analytical queries. Within this database, several mostly heterogeneous sources are brought together. Here, all the data is “stored” in a structured way, which can be retrieved for further processing at any time. A data warehouse can collect and combine data on a very large scale. While they used to be hosted on premise, data warehouses are now mostly based on cloud technologies. The volume of data required continues to grow, so that cloud-based modern data warehouses can retrieve or transfer any amount of storage volume without depending on servers.
taod: Frederic, what else distinguishes Modern Data Warehouses?
Frederic Bauerfeind: They are incredibly flexible and can be managed by any member of the data team thanks to no-code components. Important for analysts: Business intelligence platforms are very easy to integrate. There is therefore direct access to data in order to Reports and dashboards to create. And: They are ideal for managing users and rights. The topic of data governance cannot actually be managed properly without a data warehouse.
“If the worm is in here, all analyses become rotten. ”
The data warehouse is already part of the standard equipment of many companies. How do they know that their chosen solution is working?
I know three typical pain points of companies that are dissatisfied with their data warehouse. First: Too little computing power. Data volumes and analysis requirements have changed rapidly in recent years. Technologies that aimed at a simple Excel spreadsheet are no longer able to cope with the requirements of the new analysis tools. Second: complexity. The more data and source systems are acquired and stored, the more confusing the warehouse can become. Thirdly, poor data quality. The data sources are not properly integrated and the entire process chain has become so complex that the data quality cannot be checked holistically. So if any of these problems arise, companies should think about whether the current warehouse is still up to date.
Without a modern data warehouse, no modern data stack?
That's right. The data warehouse is the source of truth and the single point of truth for all analysts. If the worm is in here, all analyses become rotten. All associated processes are no longer valid and are being called into question.
How can companies modernize their warehouse?
Off to the cloud. And then, as always, it depends on the company-specific requirements as to which warehouse is suitable. However, this is easy to find out and test.
Is the data lake for people who don't like cleaning up?
Yes, too. But of course, it is first and foremost a very practical and fast way to collect and store particularly extensive data. Analysts sometimes have much better evaluation options with this raw data than with prestructured data in a warehouse, as they can freely choose and combine.
What is it about the so-called data swamp?
The masses of data can quickly become so large and confusing that the lake mutates into a swamp and users sink into it. That's the data swamp. That is why a data lake is always a good interim solution, especially for huge amounts of data, but a data warehouse should definitely be connected for further structuring and transformation.
This would clearly define the role of the data lake within the modern data stack.
Anyone who collects an enormous amount of data needs a data lake. Collecting is good for now and there are a few application scenarios for this raw data. I don't know a company yet that wouldn't also need structured data and whose analysts don't prefer working with BI tools. The interplay of data lake and date warehouse therefore stands in a modern Technology stack out of question.
Then there is also the term data lakehouse. What exactly is that?
Counter question: I will now tell you five animal names. Which of the following animal names is a technology behind: Elk, Ant, Python, Impala, GNU?
As far as I know, behind every name?
That's right. There are technologies behind each of the animal names. Many names and terms are simply marketing. And back to our Data Lakehouse: It's old wine in new skins. New software makes it possible to aggregate data directly from the data lake without having to copy the data into a warehouse. In some scenarios, this can make sense. However, the basic principle of the data lake remains the same.
Where are data marts within the Modern Data Stack?
Data marts are modeled and deployed within the data warehouse. Architecturally, the data mart is located before business intelligence tools.
Must-have?
Safe Data can thus be ideally clustered and documented for specific user groups. They then cover a specific subject area. You could even link and enrich them with data from other source systems, which are then hybrid data marts. So the design options are really extensive.
Data warehouse, data lake and data mart are essential for the modern data stack. How do companies configure these stack elements?
These three elements are the linchpin of the Modern Data Stack, I say that very clearly and without pathos. Anyone who already has an existing infrastructure and wants or needs to modernize can do it just like someone without a solid basic structure: With a detailed inventory analysis, the selection and evaluation of possible tools, openness to cloud technologies and motivation.
Rome wasn't built in a day either?
Not Rome. But certainly the modern data stack.




