Deep Join

How Vertical Partitioning and Deep Joins Kill Parallelism

imageQuery plans with deep joins trees are often the result of high levels of normalisation in the data model. There are large advantages to normalising data as it minimizes the amount of data that must be written when a change happens. In traditional OLTP systems, this can be a boon.

However, normalisation is not without its costs – especially not in read intensive workloads like data warehouses.

It is time we turn to the issue of true vertical partitioning and deep join trees. In a previous post I described how column stores achieve very high brute force scans and filters and how they make use of the underlying sort order of data to achieve high compression.

Let us have a look at how normalised data and narrow tables fare with regards to parallelism.

Test Case

For this test, I will use the TPC-H dataset. This old warehouse benchmark is classified by a normalised structure representing sales transactions. In an OLTP system, this model might be a good idea (it is crazy for data warehouses) so let us treat it as OLTP data for this purpose. This test case also represents a “best case” scenario for normalisation.

The query we will look at is a SELECT statement that fetches a single order from the database:

A quick look at the query plan for this reveals the expected loop join tree:


This is actually a relatively small join tree, but it is enough to illustrate the point. Let us  reason a bit over this query shape.

Initial Observations

Notice something interesting about the join tree? Why is it a “bushy” tree? Why are we not getting this “right only” plan shape instead:


To answer this, we must consider the ordering that has to happen to execute the query:

1) Find the relevant row in LINEITEM with index seek
2) Find the matching row in ORDERS with index seek using L_ORDERKEY
3) Find the matching row in CUSTOMER with index seek using O_CUSTKEY
4) Find the patching row in PART with index seek using L_PARTKEY

1 and 2 can happen in parallel, since LINEITEM and ORDER share the same key which is given by the query. But, and this is important: We cannot execute step 3 until we have found the relevant O_CUSTKEY in ORDER. Similarly, we cannot execute step 4 until we have found the relevant L_PARTKEY in LINEITEM.

A visualisation may explain this better. This is the execution order enforced by the normalisation of the data.


Notice that LINEITEM and ORDERS can ONLY be queried in parallel because they share the key we are filtering on. If we made the query even a little more complex, this parallelism would go away too. The query optimiser does the best it can under the conditions and make a bushy tree as we saw.

Yet no matter how you put it, enforcing ordering of work is bad news for parallelism. Let us see how this expresses itself when we execute the query.

Recall that we have to crawl a B-tree from the top to the leaf to find a single row when we loop join.

Side Note: You may also want to consider that when we use temporal modeling with many From/To dates, loop joins are  often the only way to ensure a good plan search space (see comments on my blog about the Dangers of BETWEEN joins)

Crawling B-trees is expensive from a CPU perspective. Every time we fetch a non leaf node in the tree, we have to parse the tree specific data there. Not until we have decoded the non leaf node can we fetch its child node. In my blog about level 2 cache effects I showed you how this random fetch will cost you at least 50 CPU cycles (more on a NUMA system).

It used to be the case that this memory crawling would be drowned out in the time it takes to read the disk when you hit the leaf node. But with modern NAND devices, the time to seek the tree is becoming significant.

Quantify the Tree Climbing

We will use SQL Server for this case study, but the decreased parallelism insight you can gain here applies equally to all databases.

Unfortunately, SQL Server does not allow us to catch each tree crawl event on a page-by-page basis. But we can infer the tree crawl time by measuring the time between lock events. Lock events are exposed in XEvents under the lock_acquire and lock_released events. If we execute the test case SELECT statement in SERIALIZABLE isolation level, we can see when each leaf level lock (which is required to enforce this behaviour) is acquired.

At this speed, measuring clock time is not granular enough to quantify the time taken. Instead, we rely on CPU clock ticks which are exposed by the XEvent trace framework.

Here is the output of executing the SQL Statement:


To decode the exact table names you will have to use the undocumented %%lockres%% column. This allows you to translate between the hash values for the lock (second to last column). We can get the hash codes for each locked KEY entry by running this statement:


Using the XEvent trace and the lock decoding, we can now create this table:


As you can see, a significant amount of CPU cycles have been run between each request from the loop join path. Of course, some of this is work required to parse the rows and acquire the lock. But a lot of it is also wait time for memory fetches in the tree

Now, imagine this same situation in a scale-out system where the rows you need are no longer in local memory but have to be ferried across a network, adding even MORE latency to the fetch. It should be clear that dependencies like these are not good for you. In contrast, you can get away with horizontally partitioning the data (as the loops now become independent of each other) – but vertical partitioning really kills your concurrency.


In this blog, I hope I have made it clearer why large join trees and vertical partitioning build in dependencies in the data model which can significantly reduce parallelism. Normalisation has many merits when it comes to minimising the amount of writes required to keep data up to date. But if you are after high speed reading (as you are in a data warehouse) – the price of reconstructing the row from many joins becomes restrictive.

Also, note that in an OLTP system that needs very low latency on reads, having deep join trees set and upper bound on how fast you can go.

As with all high scale designs, the key to concurrency is to eliminate shared resources and to remove dependencies between threads in the execution path. High normalisation is NOT the way forward here because it CREATES dependencies instead of eliminating them. There is no way around good old Amdahl’s Law.