I previously made the point that source systems cannot be trusted to generate good keys. In this post, I will explore the pollution you may encounter as you dip your feet into the crude oil leaked ocean of data quality.
Let us look at what can go wrong with keys:
This is perhaps the most common pathology. It occurs when a programmer believes it is better to use a composite key than a single column key. This is rarely motivated by thoughts of sharding or of worries that the 64 bit integer space may run out. Normally, it is driven by nothing but laziness.
Composite keys have several problems:
- They are difficult to use for junior SQL coders, especially in join conditions and IN / NOT IN clauses in filters. Wrong usage may lead to wrong results that go unnoticed.
- They typically perform much worse than single column keys. We care about this in a data warehouse.
- It is difficult to create good statistics on composite keys to serve queries well. This can lead to poor query plans and unpredictable performance
There ARE cases where composite keys make sense in a source system, but given the trouble they cause, the burden of proof rests on the designer of the composite key to show why it is superior to a single column key. As the ETL developer, you typically just have to accept the status quo.
Mentioned for completeness. Even though good algorithms (both distributed and single machine) have existed for ensuring key uniqueness since the early 70’ies – there are still source system which have “keys” that are not real keys. If your key is not unique, it is not a key!
However, these “keys” are more common that you might think, especially when the source system delivers extracts that are result of joins (some programmers may throw a DISTINCT in there for good measure)
Assume an inventory system that uses EAN as a key. When a product is in stock, the product table of the inventory system containss the EAN key. If a product goes out of stock, a cleanup job may run that removes all unused EAN keys. From the perspective of the ETL developers – it looks like the key just died and we are left with the question about what to do about it. Later, the product comes back in stock and the EAN number is inserted into the inventory products table again, the key has risen from the grave.
Unfortunately, zombie keys can be hard to tell apart from…
Named after the cloned sheep, because sheep are stupid and clones are hard to tell apart, even though they are two different entities. Dolly keys happen when a source deletes a key and later re-uses that key for a different purpose and to name a different entity.
Take Social Security numbers as an example. In Denmark, where I have a social security number, this key is in the format: YYMMDD-XXXX, with the first 6 digits being my birthday and the last four making the key unique (this by the way includes a checksum on the last digit). The quick reader may notice that this only leaves 10000 keys for people born on a given day (and not worrying about the shift of centuries). This is fast becoming a problem in Denmark, where we are now considering the use of letters in the keys, or re-using keys. This amounts to a lot of trouble for systems storing social security numbers. When it comes to keys, just use a value that is big enough to hold all instances – the 32 or 64 bit integer space should suffice for all by the most extreme values. 32-bit values were good enough for running the Internet, they are good enough for you!
It happens that a source system has more than one key for the same entity. This can be the result of historical corrections and may represent a valid source scenario in the source.
For example, a source may have two rows for the same customer, because the duplicate entry was not discovered at the point of data entry. When the error is discovered, both keys are set to point to the same entity in the source. When we extract the data we are faced with correcting this error in the customer dimension.
In this post I have provided an overview of typical source system key pathologies. It now rests on my shoulders to describe how to handle these pathologies in the ETL code.