Why Integer Keys are the Right Choice!

The blog subject has come up a lot during my years of consulting. Source systems, traditionally composed of shoddy code and poor data modeling, will often use “natural keys”. Such keys can be of any data type, strings (CHAR/VARCHAR), DECIMAL, or other inefficient types. It is my claim that keys in a warehouse should always be integers and that it is not that difficult to implement proper keys. In this post, I will show you what the price is of using non-integer keys and let you be the judge of how much hardware you are willing to waste on making lazy key choices.

There are several reasons quoted for using non-integer keys in source systems:

One is the use of “intelligent keys”, keys containing some form of information about the entity they represent. I will give this choice the proper beating it deserves in a future post.

Another reason I hear is that character keys are “good enough”. A convenient choice has been made that makes the key easy to render. For example, they key may be stored as a zero-padded string of integers, instead of using a proper integer type. The fact that rendering choices are put into the database has its own problems of course – but if you read data off most ERP systems – chances are you are reading zero padded integers turned into strings.

Yet another reason quoted, is that the keys arrive to the warehouse from a master data management system. Why a master data system would ever spit out a non-integer key is a mystery to me – for reasons that I hope become obvious in my next post). But I have seen it happen, master data systems generating character keys.

Believe it or not, sometimes the reason quoted for using non-integer keys is that the key is generated as string concatenations of several columns to make them unique.

From a best practice perspective, Kimball recommends using integer surrogate keys for many of the same reasons I will use. But often, this guidance is ignored and source system keys are loaded directly into fact table or other warehouse models to “save time” on development.

The basic fact, which I will now proceed to prove to you, is that non-integer keys are a waste of CPU cycles! CPUs deal better with integers than string types. But just how bad is it? Let me illustrate with an example. Let us consider three competing design choices for a fact and dimension table:

  • Using a 4-byte integer key as per Kimball/old style data modeling traditions
  • Using a character key with some basic case insensitive intelligence built in. To make a fair compare, we shall use a CHAR(4) which takes up the same space as the integer key.
  • Using a character key as above, but where we are a bit more cunning about the collation choices. To improve join speed, we will use Binary collations (Which makes string compares faster)
    The following code generates 64K row dimension tables representing these three choices. The code will also generate three, 65M row fact tables that match the dimensions:

Now, we can run our typical data warehouse query: a  join of a dimension with the fact and grouping on an attribute:

Running on my laptop with a warm cache (taking I/O out of the equation with zero physical reads) gives me this result:

Key Choice CPU Time Clock Time Logical Reads Fact / Dim
4-byte Integer ~7 sec ~7 sec 13770 / 140
Char(4)Case Sensitive, Accent Sensitive Collations ~12 sec ~12 sec 13700 / 140
Char(4)Binary Collation ~8.5 sec ~8.5 sec 13770 / 140

This should make it pretty clear that it is worth using an integer key to save yourself some money on hardware and to spare your users the frustration of waiting for joins to return. Also note that the above effect will become even more pronounced as you add more dimensions to the fact table.

If you are stuck with character keys, it also highlights that there you will most likely benefit from using a binary collation (if you can live with the semantics implied by that choice).