History tracking in warehouses is a controversial discipline. I this post, I will begin to unravel some of the apparent complexities by taking apart the history tracking problem, piece by piece.
Observation: History tracking an EDW is not the same as archiving the source.
I already blogged about using HADOOP or “dumb databases” for archival or source system data. The purpose of these archives is to avoid the discussion with the business user about loosing data. We can always bring back the source if we do it right. I also voiced the opinion that you should not bother with modeling the source, and that it is trivial to automate this archival process in such a way that the source can be restored to any point in time.
Observation: history tracking is part of the business requirements for a warehouse.
We will only track history on data that the end user needs history about. We will not model history just for the sake of modeling it.
Observation: Fact table history is very different from dimension table history
I hope it is clear that tables containing fact data typically don’t act like dimension/reference tables. I will assert that OLTP designers (the guys we get our source data from) have known and agreed upon this simple observation for years: tables behave differently and fall into certain patterns. For example: Most OLTP system have tables that are “add only”, which means we don’t need to track history on them (they are already historized). Another example: By following the well laid out arguments in Kimball’s books on how to model inventory systems, it is also clear that we need new modeling tricks, for example “snapshots”, which are quite distinct from they way we model dimensions,
In other words: Just because a fact table has from/to date does not mean we use a type 2 dimension pattern to model it.
Dimensions, on the other hand, have a property that facts rarely have: The history we model on these tables depends a lot on what we are trying to achieve for the business user and their behavior is not dictated by the behavior of the source.
Kimball has written extensively about modeling fact data, and I mostly agree with his observations and techniques. I am not even sure this statement is disagreement, but: I would like to point out that all the talk about “choosing the grain” generally doesn’t make any sense to me: just pick the lowest possible grain you can get from the source.
After this initial tap-dance, let me turn my attention to the subject I really want to talk about: modeling dimensions and the problem with surrogate keys.
Let me start from a set of commonly perceived problems for Kimball style warehouses. Imagine you have a model like the below:
Now extend your mental image to include a situation where we change our mind about the history tracking in the customer dimension. We obtain knowledge (maybe through data cleansing) about the education of our customers. There are two ways to handle this:
1) In the simplest model, we may simple add this information “going forward”. Our old, historical records remain unchanged, but we can analyze the future data by education. This is not the problem I am concerned about, and I think this modeling trick has limited use.
2) In the more complex model, we may wish to change history “backwards in time”. This means that the data model may change like this:
It is the complex model that seems to worry the opponents of dimensional modeling for Enterprise Warehousing, especially the transformation indicated by the orange arrow above.
Here is the crux of the argument against dimensional modelers, as I have understood it from my many discussion with warehouse specialists:
Problem 1: changing your mind about history means that ALL fact tables that reference the Customer dimension must change to reference the new keys in Customer’. For some reason, this is perceived as a large dependency problem that is insurmountable at scale.
Answer 1: if you don’t have a meta model (or naming convention) that allow you to quickly (literally in minutes) analyse the impact on your database of making such a change, you are doing something very wrong with your dimensional model. This problem has been solved several times by builders of dimensional enterprise warehouses – I am sure you can solve it too.
Problem 2: changing the history tracking of the dimension requires large update statements to all facts that reference it.
Answer 2: this issue has some merit, but it depends on the implementation of the database you work on. It is true that in a multi TB system, such an update can be painful. But even in an SMP system, I would temper this statement with the comment: “but not as painful as you may think”. This will be the subject of a future post.
We should also observe that the impact on fact tables means that the surrogate keys we use in Kimball models, are not good keys (as per my previous definition). The reason they are not good keys, is that Kimball surrogate keys track two things:
- The uniqueness of the dimension keys
- The history of the entity
When one of these two are volatile (the history) the entire model becomes volatile. This is not a desirable trait of a large warehouse.
Problem 3: not every business user wants the same version of history tracking, so a single type2 dimension table is likely not going to do it.
Answer 3: this is also true for some scenarios, and it hints at a model where there are multiple versions of each dimension and a new way to link facts and dimensions together. We observe that several warehouses are built where this never happens, so we should be careful not to over-generalize the model technique to always take an edge case like this into account.
Here we observe that this additional requirement does not mean that we need to abandon the notion of type 1 columns – we just have an additional requirement which we could call “multiple views of type 2 columns”.
The three above problems (of which only the last two have any merit) are quoted, in different forms, by the persons I have heard argue that dimensional models are unsuitable for an “Enterprise” Data Warehouse.
The solution, and this baffles me, seems to be: use a 3NF data model instead of Kimball’s model. I think this 3NF approach is a fallacy and I have witnessed the horrors that arise from it. But as I have admitted, some of these arguments DO have merit and should be addressed in the model.
- Good keys, what are they like?
- Data Warehouse reference architecture – Staging and Archive (external link – great related post by Marcel Franke)
Category:DW and Big Data