As hinted at in my post about how column stores work, the compression you achieve will depend on how many repetitions (or “runs”) exist in the data for the sort order that is used when the index is built. In this post, I will provide you more background on the effect of sort order on compression and give you some heuristics you can use to save space and increase speed of column stores. This is a theory that I am still only developing, but the early results are promising.
To approach my theory, I am afraid we are going to have to dig into some concepts from information theory.
Compression works by eliminating superfluous information in a stream of data. Unfortunately, it is rather hard to grasp exactly what information is superfluous – as this depends not only on the representation of the data, but also on how much context is shared between the sender and receiver.
The best way I have found to describe this problem is the story about the author who, after having sent his new hit novel to the publisher for publication, went on vacation. After a week, the author sends the following postcard to the publisher from his vacation resort:
The publisher, without hesitation replies:
We all know what this information exchange conveys, because we know the context of the publisher’s and author’s conversation. Very little, if any, information is lost even though the amount of data transmitted is as low as it gets (a single byte).
Unfortunately, we are rarely in a situation where the shared information between sender and receiver is as large as this. We have to find a different way to quantify how much data can be thrown away while preserving the information content.
Side note: The above story is not my own but borrowed from somewhere. Unfortunately, I fail to recall where I first read about it (it may have been Steven Pinker). If you know, I would appreciate if you could let me know the source so I can give proper credit.
Low Entropy ≈ High Compression
Formalising the information content of a data transmission turns out to be a hard concept to wrap your head around. Fortunately, this has not prevented both computer scientists, mathematicians and statisticians from making progress in this area.
Claude Shannon, working on signal compression first came up with the notion of Information Entropy. You can read more in articles I will link throughout this blog. They describe the theory more elaborately than I can.
Roughly speaking the Entropy H(X) of a random variable, for our purposes the content of a database column, is the amount of uncertainty associated with given value of that column. Entropy is typically measured in bits. If the entropy of a column is low, we would expect that there are either very few values in the column or that they are unevenly distributed somehow. The maximum entropy of a column is achieved on keys and it goes down after after that.
Entropy of a column, X, is calculated as:
Where p(xi) is the probability of each value the column takes. We can calculate all these probabilities with SQL like this:
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
GROUP BY X
And we can get H(X) like this:
SELECT -1 * SUM(pX * LOG(pX, 2)) AS H
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
GROUP BY X
) AS p
Think of H(X) as an upper boundary for how much we can theoretically compress the column. This should also tell you why you cannot tell in advanced how well a compression algorithm will work until you have actually seen or fully described the data – anyone who claims otherwise are bending the truth.
In summary we can use the entropy as a rough estimate of how many bits it would take to fully describe the column and it also roughly correlates to the number of repetitions in the column (which is what we are after in column stores). So far so good…
Mutual Information I(X;Y) and Variation of Information d(X,Y)
With entropy, we can get a rough ranking of columns by their compressibility. This is generally useful, even for page/dictionary compression. In other words, entropy allows us to come up with good candidates for columns to lead a clustered index on a fact table.
However, in column stores, we want to take this one step further. We are interested in how columns relate to each other, so we can group similar columns together during the index build phase and achieve even fewer run length and higher compression. This eventually translates to more query performance and space savings.
Fortunately, information theory provides us with an estimate of this, namely: mutual information. For two columns, it is defined as:
p(x,y) is the empirical measured frequency of each column combination, divided by the total number of combinations of X and Y. We can extend the SQL statement above to this:
SELECT X, Y
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pXY
GROUP BY X, Y
And by joining up the calculated probabilities from the Entropy calculation we can find I(X;Y). We are almost there, bear with me a little longer. The value I(X;Y) tells us something about how much information (again, measured in bits) the column share. Combined with H(X) we can now calculate the “information distance” between two columns. This is know as the Variation of Information and is denoted by d(X,Y):
At this point, it should be clear that it is possible calculate a matrix like this:
Think of this matrix as the information distance between each column, a map of the entropic landscape if you wish.
Side note: Unfortunately, this matrix is rather computer intensive to build. It takes O(n * c2) space and time to calculate I(X,Y) for all column combinations, with c being the column count in the table and n being the row count. It also requires some interesting, high speed data structures (may be subject of another blog) that I had to build myself because .NET 4 “concurrent” dictionary was not up to the task. I am currently looking for a way to estimate I(X,Y) instead of brute forcing the exact number. Ideas on how to approach this will be most appreciated.
Why Optimal is Hard
With our d(X,Y) matrix, we may be able to find a path through the information landscape that minimises the amount of shared information of columns. Intuitively (?) this should increase column store compression by minimising the number of run lengths.
At this point, I have to make a few disclaimers. What I am about to propose is NOT an exact solution, in fact, it may even be worse than guessing in some cases. There are a couple of reasons for this
First of all, just because I know the mutual information I(A;B) and I(B;C) it does NOT follow that I know the total mutual information of the combination represented by I(A;B;C). For example, there could be an underlying structure in the data that is not detected by comparing column pairs. Such an underlying structure could send the following algorithm off into a walk goose chase.
Second, it may not even be true that lexicographical ordering of the column values leads to optimal compression. There are other ways to sort columns, for example in Gray Cycles.
Last, and most important: finding the “shortest path through the information landscape” turns out to be an NP-hard problem. This has been proven here: “Reordering Columns for Small Indexes”. The problem has even been baptised: RUNLENGTH.
What does it mean that a problem is NP-hard? Consider for a moment our reordering of columns problem. There are c! ways to order the c columns in a table. This is a huge number ( c! = c * (c-1) * … * 1). We cannot explore them all in reasonable time, so we need an algorithm that finds the best one without looking at every combination. When a problem is NP-hard, it means that such an algorithm is known to either NOT exist, or if you can find it, you will win the Nobel price (because you would then have proven the great mystery of computer science, namely the P = NP problem).
Can We Make a Good Guess?
When a problem is NP-hard, our best bet is to come up with good heuristics. With the d(X,Y) matrix available, one particular heuristic lends itself well to easy implementation:
- Let S be an ordered set of columns
- Pick a starting column c1 (the one with lowest entropy H(X) is often good)
- Add c1 to S
- From the columns not yet in S, find the one that has the smallest distance to any column in S. Add this to S
- Repeat above until all columns are explored
- Intuitively, we are greedily traversing the “landscape of information”, always adding the column that is closest to the part of the landscape we have already explored.
Early Test Results
With the help of readers on this blog, I have managed to collect sample datasets from four retailers. Running my algorithm above, I can now quantify how well it does on these datasets. The following table lists the results:
A bit more detail about the algorithms here
Cardinality sorts the columns either by lowest cardinality first (ASC) or highest cardinality first. Notice that this heuristic actually makes compression WORSE than random in some cases. Time complexity is (n * c) and interestingly, the space complexity can be done in O(c) if using HyperLogLog to estimate the cardinality.
H(X) Entropy sorts the columns by either increasing or decreasing entropy. On the tested datasets, this algorithm fare somewhat better. It also had the advantage of being very fast to calculate: O(n * c)
d(X,Y) Greedy is the algorithm described above (which is very slow to calculate: O(n * c2)
Random is random ordering of the rows.
All data sizes have been measured using the SQL Server 2012 column store compression
To compute the test results above in reasonable runtime, I have written a little utility in .NET 4 called TableStat.Exe.
This utility can run on a sample and assuming there are not too many columns, crunch through tens of thousands of rows every second to calculate all the algorithms described above. When it is done running, it will output the SQL statements required to run the experiments to measure compression. Optionally, it will even run the experiment for you.
I would really like to collect more data on my candidate heuristics. If you have some real data that you would like to test on, I can get you a copy of TableStat.Exe. I don’t even need to see your data (the test can be run fully on your site), but I would ask that you share the compression results with me.
Send me an email, or comment on this blog, if this has your interest.