Radiation no digging

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

I have been meaning to write this blog post for some time and the discussion about Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a structure like this:

The join criteria is expressed:

Or more commonly, this variant with a semi open interval:

Data models that promote these types of joins are very dangerous to relational optimizers and you have to step carefully when executing queries with many of these joins. Let us have a look at why this is so.

Temporal Join

To illustrate the issue with this query pattern, let me create a very simple test setup that you can experiment with. Use this script to generate the two tables:

 

(see my utility functions for fn_nums)

SmallTable above is a temporal tracker table with 10 temporal records per BusinessKey. BigTable is about 400MB large and does not use temporal tracking but has a TranDate column that determines which temporal records in SmallTable match it.

Let us now try to execute a reporting query where we ask for an aggregate over BigTable joining it up to its matching temporal keys in SmallTable.

Some quick statistics about this query execution on my laptop:

  • CPU time: 25547 ms
  • Logical I/O operations: 50762  (no physical)
  • Memory Grant: 370 MB
  • Rows Returned: 2600

Nothing overly suspicious yet (though the memory grant seems rather right). Let’s just have a quick look at the query plan:

image

That is a pretty big misestimate isn’t it? And that is the crux of the issues, it is immensely hard for a query optimizer to accurately predict that the join on the temporal table will lead to one and only one row (unless you have a temporally aware database of course).

But misestimates are not the full story, there is also a higher CPU cost involved in doing this join. At the CPU instruction level, more work needs to be done to find record matches an interval than doing a straight compare of two values.

Now, you can imagine what happens if you have a data model that has very long chains of these joins. As you probably know, query misestimates (and the risk of bad query plans) typically grows exponentially with the number of tables being joined. Having a data model that almost guarantees poor estimate even in a two join setup can quickly lead to interesting tuning challenges.

Going Kimball Again

There is a very good reason Kimball recommends integer keys for type2 dimensions instead of the temporal join you just saw.

Let us change BigTable into a Kimball representation instead by pre-joining like this:

We can now rewrite the previous aggregate query to this:

Let us first have a look at the query plan for the Kimball style join:

image

Same query plan shape – but look at the difference in estimates vs. actuals! We are spot on here.

Comparing the Kimball style Type2 join with the temporal join we get:

Measurement Temporal Join Kimball Type 2
CPU Time 25547 ms 10844 ms
Logical I/O 50762 50762
Memory Grant 370 MB 315 MB
Rows Returned 2600 2600
Misestimate 3x None

Summary

In this blog entry, I have shown you why temporal style joins can be dangerous to query optimizers. While it is not always possible to avoid them – extreme care should be taken if you chose to include them as the only way to access your data model.