In this blog, I will provide you with some basic information about column stores. Nothing I am writing here is vendor specific IP, but merely taken from the papers published throughout history. One of the best papers that serves as an introduction is by Stonebraker:
- Stonebraker et a, Proceedings of 31st VLDB Conference, 2005: C-Store: “C-Store: A Column-oriented DBMS”
- The idea is older than that though, with the first papers published in the 1970’ies.
Shamefully standing on the shoulders of giants, I will walk you through a simple example which illustrate one of the key principles of column stores: Run Length Encoding (RLE).
The Row Format
Let us first assume we have an imaginary bar, which sells alcohol to a few clients: Thomas, Christian and Alexei. The transactions for their night out look like this in row based format:
This is our “baseline” input data for the example.
A few things to notice already: We COULD have optimized this table “Kimball Style” by replacing the columns with corresponding integer keys. But that would have made the example harder to follow, so for now, just assume that we have chosen a binary representation that is optimal for the keys and which COULD (if we so wished) join out to smaller lookup tables to reconstruct the string values.
Turning rows directly into columns – a bad idea
There is another way to represent the table in row based format. What if we gave each row a unique number and vertically partitioned the table, starting like this:
And proceeding until we got this:
Obviously, this is a very poor implementation of the data.
Side Note: Once we are at this point we COULD proceed towards a 6th Normal form model where each unique entry in each of the vertically partitioned tables is directly linked (through a m-n relation) to the corresponding rows in the Date column (date would then collapse into a single row). This would essentially be what is known as anchor modeling. We could also proceed towards a Data Vault style model and normalise Customer and Product into separate hub/sat tables and link the data to them.
However, if we did this, reconstructing the original row would require a lot of joins. And we don’t like paying join premiums on large tables.
However, there is a way to proceed from here that is not only compressed, but also very CPU/cost efficient.
Turning columns into Run Lengths
Consider again the product column. There is a more compressed way to store it, namely this:
Notice something interesting here. We have NOT normalised the data. For example, there are still 2 x Vodka entries in there. The representation in Product’ is what we call a Run Length Encoding
What we have done is to store how many repetitions exist in the data for a given sort order of the data (namely the order in which we numbered the rows).
Already now dear reader, you can derive your first easy column store optimisation from the above: Don’t store the from/to row numbers, just store the number of repetitions. If we do that, we could potentially store each run length (for example “1-2, Beer”) as only two integers: one for the key of beer (with the character value “beer” put into a dimension table) and another for the number of repetitions.
If we proceed with the un-optimised example, we can collapse the table into this structure:
There are STILL repetitions in the data, the data is NOT being taken to a new normal form here. However, we only store the repetitions that are required for efficient processing of the data when we query it, as we shall see later.
Because the typical case for large tables is that they DO have a lot of repetitions, column stores can often achieve significant compression benefits. The quoted compression rates from different vendors (as compared with the raw format) is typically in the range of 5-50x – depending on what data you store of course.
Restoring rows from Columns
How can we then restore the columns into rows again? We proceed by a form of “pseudo merge join”
Notice that because of the way data is stored (with duplicates added in the right places), this “join” can proceed in a linear, O ( n ), highly parallelised manner, no matter which column we select. This data structure also has the property of making it easy to sequentially access memory. I have described why this is important here: The effect of CPU Caches and Memory Access Patterns.
Side Note: Loops joins are notoriously CPU inefficient (because they crawl memory in a random manner, paying CPU cycles to wait for memory every time a new level of a B-tree is accessed).
The storage formats furthermore allows quick filtering of data by simply jumping over things that don’t have the right values.
A lot of other interesting things can be done to optimise column stores even more – but that is the secret sauce of individual database vendors and an area for fierce competition these days.
I hope this blog has helped you better understand what a column store is and how this storage format assists with fast reconstruction of rows through linear scans of the data.
I hope it is also clear why such a storage format lends itself particularly well to very wide tables where only a subset of the columns are requested for each query (fact tables for example). With the above picture of what a run length encoding is, you should be able to reason more clearly about what is happening inside a column store database.
Category: DW and Big Data