SQL Server Logo

Database Index Page Row Order

A few weeks ago I was asked a great question by Sam, who follows my blog. With Sam’s permission, I am reprinting has question about indexes here.

Sam asks (my highlights) :

“We are told that we can have only one Clustered Index since we can actually sort the data in only one way. However, a clustered index orders data in the leaf pages in sequential order, but does not order the data within that page itself to be in sequential order. Since the data on the page itself is not sorted sequentially (and thus implying more than one way of possible order), I am confused by the “Since we can sort data only in one way we can’t have more than 1 clustered index” reasoning.”

Does Physically ordering rows inside database pages make sense?

 

For example, let us say there is a clustered index on the names of World Cities and let us say that names starting with A and B fall on the first page; names starting with C and D fall on second page. And let us say that on the first page, we have the cities Atlanta, Bombay and Boston and on the second page we have the cities Cleveland, Delhi and Detroit.

In the above scenario, the data could have been stored in the following manner in the Clustered Index:

Page 1 (A to B):
Boston

Atlanta

Bombay
Page 2 (C to D):
Detroit

Cleveland

Delhi

Since, the data on the page itself is not sorted sequentially, it is also possible to store data  in another order as follows:

Page 1 (A to B):
Atlanta

Bombay

Boston
Page 2 (C to D):
Delhi

Detroit

Cleveland

So, while it is logically possible, in reality we cannot create another clustered index, where the order of data across pages is still maintained, but the order within the page can be different.

[…]

Can you give your reasons/opinions here?

Another aspect of this question is, is there any reason SQL Server does not order the data within an individual page of a clustered index? Is it because it  may be more expensive to re-arrange data within the page for every insert but cheaper to insert within the page in whatever order but rely on Row Offset information?

These are excellent questions that touch on some great areas of computer science. I am always happy to give my opinion.

Illustrating the Issue

Before I move on to the opinion bit, I would like to show you a repro that makes illustrates Sam’s point that rows are not physically ordered within the page.

Consider this simple table:

USE tempdb
GO
CREATE TABLE Foo
(
K CHAR(10) NOT NULL PRIMARY KEY
, Payload CHAR(23) NOT NULL DEFAULT(REPLICATE(‘_’, 23))
)
TRUNCATE TABLE Foo

DECLARE @Row INT = 0
DECLARE @Val INT = 42
SET NOCOUNT ON
WHILE @Row < 1000 BEGIN
/* Create semi random numbers, correcting for horrible SQL Server hash function  */
SET @Val = ABS(BINARY_CHECKSUM(@Row, @Val, ‘A’)) % POWER(2,30)
PRINT @Val
PRINT (sys.fn_varbintohexstr(CONVERT(BINARY(4), @Val)))
INSERT INTO Foo (K) VALUES (sys.fn_varbintohexstr(CONVERT(BINARY(4), @Val)))
SET @Row = @Row + 1
END

This will create a table with 100 rows inserted in a random order Now, let us have a quick look at the pages in the table:

DBCC TRACEON(3604)

DBCC IND (2, ‘Foo’, 1)

Pick any PagePID with PageType = 1 and inspect the page with:

DBCC PAGE (2,1,,2)

Here is what I get on my machine from the DATA dump:

0000000071CAA064: 0x000002e1__________
0000000071CAA078: _____________…..%.
0000000071CAA08C: 0x00002f51__________
0000000071CAA0A0: _____________…..%.
0000000071CAA0B4: 0x0002f751__________
0000000071CAA0C8: _____________…..%.
0000000071CAA0DC: 0x002f7651__________
0000000071CAA0F0: _____________…..%.
0000000071CAA104: 0x02f76151__________
0000000071CAA118: _____________…..%.
0000000071CAA12C: 0x032e2c70__________
0000000071CAA140: _____________…..%.
0000000071CAA154: 0x03210390__________
0000000071CAA168: _____________…..%.
0000000071CAA17C: 0x021ffa70__________
0000000071CAA190: _____________…..%.
0000000071CAA1A4: 0x00444590__________

As you can see, the data in the page is not physically ordered on the page. However, let us just see how the slots map on the page:

DBCC PAGE (2,1,,1)

From my machine:

Slot 19, Offset 0xd8, Length 40, DumpStyle BYTE

0000000000000000:  ..%.0x002f7651______

And:

Slot 80, Offset 0x128, Length 40, DumpStyle BYTE

0000000000000000:  ..%.0x032e2c70______

As you can see, the slots on the page are not in the same order as the physical layout of the data. In fact, the slots are ordered as per the index.

Physical vs. Logical Order

By using an indirect mapping of row slots to physical locations on the page, we can use a simply binary search to quickly locate a specific row within the 8K page. Unlike on disk, there is no such thing as “defragmenting” a page to make the physical order match the mapping order.

Would it be worth doing this and paying the cost of keeping rows physically ordered (maybe even in different ways, depending on access pattern).

First, consider why we worry about logical and physical order in the first place. We do this to get more sequential instead of random I/O  on spinning media. This optimisation drives down cost when running on hard drives – but is largely irrelevant on SSD. However, once a page is in memory, the issue of sequential I/O is no longer a problem we need to worry about, we can fall back on old fashioned binary searching. It would therefore seem that it makes no sense to pay the extra cost of keeping the rows physically sorted on the page, because once they are in memory, its fast to find them.

Second, consider the additional cost of logging physical reordering of the pages during inserts and update operations. We would have to log not only the changes in the map, but also which rows were shifted around in the data when a new row is inserted. This makes inserts more expensive and would make page splits, already expensive, more complex.

So far, everything seems to indicate that SQL Server has chosen the right balance with regards to the page map logical order vs. the data physically ordered on the page.

Random and Sequential Memory

The cost of locating a row within 8K of data using a binary search might seem trivial compared to the cost of bringing data from disk to memory in the first place. However, once you consider modern systems that largely run in memory or have very fast disk – the cost you pay to find stuff, even if you know it is memory resident, becomes interesting.

It turns out that it matters quite a bit how you access memory. It takes around 100 CPU cycles to get 64B or 128B of memory from DRAM to the CPU. This quickly adds up and the CPU reports itself as busy while waiting for that memory (there is no such thing as a “CPU waiting for memory” wait stat). I have blogged about sequential memory access here:

Joe Chang also made the interesting observation that decoding exactly where the data you need to look at next is locate is a very CPU intensive operation:

All this indicates that there might be a better way to store data inside pages in SQL Server. For example, if you use a column store, physically reordering the rows, as suggested by Sam, inside the pages can lead to better compression. I have blogged about this here:

Sam’s question hints at another interesting option, namely to not bother maintaining a sorted row map at all. Because visiting memory sequentially is quite a bit faster than visiting it randomly, it may simply be better to visit all the rows in a page instead of trying to be smart about finding the one you are looking for. Such a a memory efficient access pattern has been suggested (and proven) to work in this article:

If the rows were stored sorted, it might even be possible to write a “forward only” search that

Summary

In this blog, inspired by Sam’s question, I have been musing over the trade-offs you have to make when storing data inside the pages of a database. There is quite a bit of performance to be found here by optimising for different workloads and CPU architectures. It is the subject of a lot of research and by following the links above, you can start to dig deeper.

  3Comments

  1. Endrju   •  

    In case of in-memory databases, more and more low-level things starting to become important for performance. Complementing what you described, and referred to, I’d like to link to another vivid explanation of performance difference of processing sorted vs random data at CPU level: http://stackoverflow.com/a/11227902/986161 (don’t care about “Java” in title). I found it useful, and I think other readers of this blog will, too. Thanks.

  2. Chris Adkin   •  

    Thomas,

    With regard to the sorting of data and column stores, for a simple query that aggregates date on a date key column, or any other key column for that matter, I have found that creating a clustered index on the data ( sorted on the column(s) being grouped on and then creating a column store index on this makes a tremendous difference to performance. I suspect that is due to the fact that as you scan down the column store, when the hash aggregate is being performed access to the pages in the hash table is more sequential in nature. However, I’ve been unable to achieve 100 % CPU saturation via this, my theory is that this is down to two factors:

    – The SQL Server 2014 large object cache is not numa aware.

    – Because most of the columns I’m summing are floats, I believe I might be saturating the port at the CPU back end associated with floating point addition.

    SIMD is one solution to this, I’ve also been lead to believe that with the Haswell based micro architecture an execution unit on port 0 ( as well as the one on port 1 – only one that can be used on Sandybridge ) can be used for floating point addition, this is a FMA – fused multiply add execution unit. Another point of interest is that Intel are making noises about unifying the AVX instructions across its future Xeons and Phi co-processors.

    Chris

    • Thomas Kejser   •     Author

      Thanks for the insights Chris. Did you ever blog about your experiments? I would like to link it from here.

Leave a Reply

Your email address will not be published. Required fields are marked *