Zombie vs Sheep

Transforming Source Keys to Real Keys – Part 1: Introducing Map tables

I have ranted enough about the key quality, or lack thereof, in source systems. It is time to look at how we will practically go about transforming the dirty source to the final DW model. What I shall propose is a de-coupling of the source entities from the data warehouse entities. As will be shown, this leads to an interesting ETL pattern that can either address, or properly encapsulate, a variety of source problems.

Because this post is about key generation, I shall limit myself to talk about dimensions or “reference data”. This data typically resides in the small tables in the warehouse – going no higher than 1 Billion rows (for VERY large customer dimensions or IP traffic monitoring).  More commonly these tables range in the 10**3 to 10**7 rows. In fact, I will claim that small dimension/reference tables are the 99% case. I will assume it is reasonably clear to the reader that tables of such miniscule size can be comfortably be extracted and fully loaded into a temporary table in the warehouse staging area. If this should be unclear, or if you object to this assumption – please respond in the comments.

I will also assume that we agree that source systems typically suffer from one or more pathologies as described earlier. And even if they don’t, we must at least assume that sources changes as new companies get acquired, big projects change IT infrastructure completely (SAP anyone?) or other similar events outside the control of the warehouse. We are thus faced with the task of keeping the data warehouse model relatively stable under these changing conditions.

First, I propose that the warehouse has one table for each entity with at least this structure:

For now, please defer thinking about the exact nature of an entity, suffice to say that it is a concept that maps to some interesting object instance that can be uniquely identified. Also note that the table above has ONE row per entity, not one per historical version of it (i.e. it does not behave like a type 2 dimension table)

For each source system that is can be a source of an entity, we shall create a map table with this structure:

The map table is used to map between the row in the source system and the entity in the data warehouse. The entity we map to in the DW is not guaranteed to have a 1-1 relationship with the rows in the source, the map is the structure that encapsulates this discrepancy.

By smartly populating the map table, we can handle source system errors during ETL.

Example Scenario

At this point, I think it is best to take the argument forward with an example. Let us assume we are designing the ETL system for a product table in the warehouse.

The company we work for has two source systems for products: inventory and sales. The sales system uses the SKU as the key for the product, and the SKU is generated by the sales system because they run a big German made system that likes to be in control of these things. The inventory system uses the EAN of the product instead, since they use barcode scanners to keep track of products. For our scenario, we shall assume that the sales system holds the name of the product, the marketing team like to come up with cool sounding brands. The inventory system holds the weight of the product, because the guys down in storage have competitions about who can bench press the largest boxes.

Example Schema

Extracting the sales product table, we end up something like this in the staging area:

And for inventory:


In the data warehouse, we would like to hold the final, conformed products in a table like this:

EK_Product is our key, we don’t trust the sources with that. Notice that EAN and SKU are just columns and not keys, we don’t assume they have values (NULL allowed) or even that they are unique.

For book keeping purposes, I added the CreatedDate, to keep track of when the entity was first observed in the source.

I don’t want to have NULL values in the Name column, so I will use the default of the empty string instead. I like to assign empty strings to text columns that are rendered in the client tools, I find that both users and client tools get easily confused by relational NULL semantics.

For the WeightKg column, we have taken the liberty of adding the unit to the column name. This is generally good practice when this is not clear from the name of the column.

For the inventory source system, I create the map table like this:

And similarly for the sales source:

The data model created now looks like this:


Some things to notice about the data model so far:

  • The EK_Product is not unique in the map tables, this will come in handy when we handle multi keys
  • The source key (EAN or SKU) is not unique in the map either. We will need this to handle zombie and dolly keys
  • We are using a secondary key constraint to enforce the always true statement: “a source key cannot have TWO different meanings at the same point in time”. If it could, it would not be a key, and we have a new problem that cannot be solved by the mapping system.

In part 2 of this blog series, I will look at how we will represent key errors in the maps. But for now, let us have a look at how we get started with populating these tables.

Initial Population of the Maps and Entities – A Foray into Master Data

The map tables will evolve as we learn more about our sources, but to get started with a prototype, we need an initial population of them. Our first question to the business users is this:

“Given that I have two different sources of product keys, which one is the master source?”

If you are in typical large organization, the answer will be “I don’t know”. You have a quest to go on, but don’t spend too much time on it, as you will see it wont matter much. At this point, you make an arbitrary assumption that one of the systems is the master source.

Let us say you chose sales to be the master source (it is after all a German ERP system). You can now do an initial population of the Dim.Products like this:

This generates the values for EK_Product that you need. You can now initialize Maps.Products_Sales like this:

Of course, none of this made the problem go away, you still don’t know how to map SKU to EAN. If your quest for master data was successful, about as likely as finding the holy grail, then you should have no problem supplying your warehouse with a table like this (which you will temporarily create for the prototype):

If you did not find the holy grail of master data, someone is in for some long hours in Excel, creating a table like the above. You may be lucky that the German ERP system contains both the EAN and SKU column, in which case a quick SELECT DISTINCT can do the initial population of this table. In either case, you will most likely find that the below query is very handy to find missing mappings during prototyping:

For now, don’t worry about silly key errors in the sources, we shall correct them later. You don’t need all the data mapped in the Staging.EAN_SKU table, just get a decent start of it. You can now populate the Maps.Products_Inventory like this:

Notice that the above process can be repeated for multiple source systems. For every new system you add, you only need to translate the keys to ANY one of the systems you already know. You are conquering the source systems one at a time, starting with one map table and gradually working your way towards a more complete picture of the world.

Things you may have noticed now:

  • There might be keys in inventory that don’t exist in sales. They may even be valid entities, but we have not yet added them to the Entities.Products table
  • We still haven’t done anything about key errors. So far, we are just preparing some data for an initial, prototype load (and incidentally, building the requirement spec for a master data system in the process).

Summary – Part 1

In this blog, I have introduced the map and entity tables as fundamental part of a good data model for data warehousing. The entity tables are there to provide key stability in the large tables and be the conformed source of reference data (dimensions if you will).

The map tables described  will be used to correct poor source keys and turn them into good entity keys.

I have also described a method for initial population of the map and entity tables. This initial populating is used to drive new questions to the business users and to explore the quality of sources.

In a near-perfect world, the map tables can be prepopulated by a master data management system. In the completely perfect world, sources would all agree on the same keys and the keys would have no pathologies. But we do no live in the ideal world, not even close. So, in the next part of this blog post I shall explore how we can use the maps to correct data errors.