Very Flexible

Modeling Dimensions with History Tracked, Generic Attributes

Sometimes, you need to model a database in such a way that you can dynamically extend the model without altering any tables. Perhaps the attributes changes faster than you can add new columns or the data you store has a “ragged” structure that does not lend itself well to being described as a traditional table structure.

The typical approach taken in these cases is to “pivot” the required “flexible columns” of a table into rows instead and dynamically reconstruct the schema at runtime. When you model like this, adding new attributes to the model is simply a question of inserting rows into the database.

As I sure you are aware, there are issues with this approach. Tuning columns into rows  can cause interesting issues for database engines. The flexibility requirement and performance loss to be balanced carefully. In this blog, I will walk you through an example of how to handle this generic case in a star schema.

To make this concrete, let me use an anonymised, real customer example I ran into this week.

Source model

Simplified, the customer has this OLTP model of employees and transactions:

image

Transactions are all linked up to Employees. Because the City and a few other columns of the employee are important to keep track of historically, Employee_History models this concept using the ValidFrom/To column. In addition to City and other static columns, employees can also have a general “bag” of zero or more attributes associated with them that vary over time – this relation is tracked in Employee_Has_Attribute which also needs the ValidFrom/To columns to keep track of the temporal aspect of this relation. For example, an employee could have the attribute “Part Time” associated with him for a year, and another employee have both the “Part Time” attribute and the “Speaks Chinese” set in another time period. Later, the administrator of system may decide that a new attribute “Speaks German” is now allowed and they can simply add that to employees by inserting a row in the Attribute table and adding the historical information in Employee_Has_Attribute.

Whenever you find yourself in a situation where the schema needs this form a generic attribute behaviour – the question you have to ask yourself is: HOW generic? Taking the fully generic route for every single attribute will often lead to poor performance, so it is something to be done with caution. Don’t make the schema generic just because you can. But sometimes, there is no way around it.

In the case, my customer had done their homework and carefully balanced their need for generic behaviour and fixed schema structures. For example making they made a good choice of tracking City as a column and only moving the truly generic attribute into the separate table.

The big question now is: How do you turn this design into a star schema?

Modeling the Star Schema – Tracking the City

My customer wanted to turn the source schema into a fast query star schema Kimball style. Let us walk through the modeling example for this interesting case.

The first principle I apply is always: “Only model what you need”. In this case, what was needed was the history tracking of attributes and the City column. While history may be tracked in other columns in Employee_History too, we will collapse those for now.

The customer had already done the cleaning up of EmployeeKey in the source, using a structure very similar to the maps I recommend. They are nice integers and follow all my rules for being “good keys”.

Step one of the modeling process after these things have been cleared is thus to get rid of the nasty BETWEEN join that is needed to get from Transactions to Employee_History. We use a standard Type2 dimension.

The first step at the model will thus look like this:

image

The ValidFrom/To dates in Dim.Employee_T2 will have to be loaded based in changes tracked in Employee_History. Note that if Employee_History tracks MORE than the City columns, the ValidFrom/To dates may NOT be the same as in the Employee_History source.

We translate the TransactionDate to a surrogate integer keys and link it up to a date dimension (Dim.Date) – standard Kimball practice so far.

Generic Attributes – Potential Designs?

How do we proceed from the T2 dimension to the generic Attribute table? We notice that because an employee can have many attributes, we have to model an  M-M relationships. In Kimball terms: We need a bridge table.

Let’s walk through a few failed attempts to weight the trade-offs.

Disclaimer for the Data Vault modelers out there: I do apologise is advance for the discomfort any chills down your spine may cause you if you see the obvious parallels between these example and how you model a warehouse.

Attempt 1: Use Employee_Has_Attribute directly?
This results in this model:

image

That wont work very well! Why? first of all, we have to do a nasty BETWEEN join, while carrying the TransactionDate over to the final join table. Also notice that the Employee_Has_Attribute table does not have key relationship to Dim.Employee_T2 – this can cause all sorts of interesting plan estimates and poor query plans which I have blogged about previously.

Attempt 2: Bridge directly from Fact to Attribute using TransactionDate
In this model, we de-normalise the EmployeeKey into the fact (recall that we have used a surrogate key to model the City history tracking).

We get this candidate model:

image

This is better than attempt 1, but still suffers from the BETWEEN join. We have also made the fact table wider and wasted space.

Attempt 3 Bridge on DISTINCT TransactionDate, EmployeeKey
We could sore the distinct join result of  Employee_Has_Attribute and their matching dates Fact.Transactions. This results in a model like this:

image

This at least gets rid of the BETWEEN join. Unfortunately, this model also results in the bridge table having a size proportional the cardinality of the PRODUCT of EmployeeKey and TransactionDate – in other words: It grows with the size of the fact table. If we proceed with this model, we will have to do big/big table joins between the bridge and the fact. And we don’t like big/big joins – they are generally not scalable.

It seems we are out of options, there is no good place to build the bridge table without either creating a large bridge or making the join criteria very bad… None of these options look promising.

Unless…

Generic Attributes – Suggested Design

I asked the customer: Does an employee’s attributes change fast or slow? It turns out they change pretty slow, faster than City yes, but still slow. There is another solution in this case.

Based on the data in Employee_Has_Attribute and Employee_History, we can find the DISTINCT UNION of the dates where EITHER the attribute changed or the City changed for a given employee. In a sense, we will count the change in the Attribute relationship as a change in the Employee. If we take the result of this distinct union of all EmployeeKey and FromDate in both the relation and the Employee table, this can be used as the basis for bounding the Dim.Employee_T2 dimension. If an attribute changed, but not the city, this would mean we would have two nearly identical rows in Dim.Employee_T2 (with only the surrogate key being different) – but this new surrogate key could be used to join to a bridge table between Dim.Employee_T2 and Attribute. The final model will look like this:

image

No nasty BETWEEN joins here. In addition, the bridge table size is now proportional with the number of times an employee changes attributes, instead of proportional to the fact size.

Summary

In this blog entry, I have shown you a way to model star schemas on top of sources that have a semi-loose schema.

Sometimes, you need the flexibility that comes with flexible schemas and there is no way around handling it. Just remember, flexibility often comes at the price of performance, especially when you have history tracking turned on. Don’t model flexibly just because you can, if something truly is static or change only infrequently, make use of that information in the model.

I have also shown you an example of a habit that I recommend building: to mentally walk through the join paths a data model force you to take through the data. When you do, you often gain insights into optimisation and alternative ways to structure tables.

  5Comments

  1. manowar77   •  

    Yes, splitting the ranges at a year-level is also exactly what we do (ad DWH and Cube level) in order to be able to partition by year (I’m also evaluating to bring it to the month level for a customer that has billion of rows and where I’m applying this technique)

  2. Davide Mauri   •  

    Hi Thomas! The solution you described is really interesting and is somehow related the “Temporal Snapshot Fact Table” technique I described in some of the speech I delivered in the last year. The latest was delivered as SQL Bits X in London. Since the “temporal problems” are quite common and quite complex to solve, if anyone is interested, the downloadable slide and demo regarding “Temporal Snapshot Fact Table” topic are available here http://sdrv.ms/MINQY0

    • Thomas Kejser   •  

      I saw your session announced at a conference (I think it was the Italian one?) and I wanted to see it but could not. Glad that you could share a copy of the slides.

      It is quite interesting that you have created this solution for the inventory snapshots. About one month ago, I proposed the exact same solution to a customer that store viewing data (“at this point in time, how many people watched this program?”). The structure of that problem is almost exactly the same and the solution with the “all possible data ranges in the date * date matrix) in the M-M join is also almost similar for the SSAS cube. I actually added another twist to it, namely that I split out the year column from the from/to date so that an interval never crosses a year boundary. That way, the M-M join only contains “all interval combinations in a single year” and you can make the join even faster.

      Another way I would like to try and solve this is to model the data in the relational store as a From/To and then have a view on top of the fact tables that joins to the interval table. The cube would have to be in ROLAP mode to avoid the “explosion” of data. That way, I could avoid the M-M join and let the relational source handle the runtime expansion of the interval.

  3. Andrej Kuklin   •  

    >>Sometimes, you need to model a database in such a way that you can dynamically extend the model with altering any tables.

    It should probably be “without altering any tables”

    • Thomas Kejser   •  

      Well spotted Andrej. I have correct. Thank you very much

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">