SQL Server Logo

Running SQL Server on a Big Box – Configuration

In this next instalment of the big box blog series I will provide some guidance on how you configure a big box for best performance.

I have previously blogged about my standard guidance for SQL Server configuration. But for a big box, you need more than this.

Think for Yourself!

From the introductory blog, you may recall that I mentioned big boxes are rare. I estimate that 95% of all boxes out there are not big. If you bought a big box,  most the “best practices” and standard guidance you find don’t apply you. You will have to deviate from the standards, sometimes in ways that there is no “official” guidance for. In other words, you can no longer rely on Microsoft to think for you, you have to think for yourself. The corollary is that you can’t rely on me to think for you either. You should take everything I say here with a pinch of salt – it may not apply to your specific case.

This doesn’t mean you you should start guessing random configuration changes. I am afraid you just need to bite the bullet and do the hard work of thinking and testing your theories. A solid grounding in scientific methodology will come in handy.

Generic Configuration

Irrespective of the workload you run, there are some general things you need to configure.

Tempdb: One file per core, no exceptions.

Trace Flag 8048: Documented here, this trace flag enables an important fix.

Trace Flag 1236: Documented here, as above.

Latest Windows Version and service pack: Your IT department may not have “validated” this. Refer to the “think for yourself” principle. The Windows kernel is constantly evolving. One of the places that are seeing improvement is in networking and NUMA awareness, I-O and memory management. Because you are a big box, those improvements are there for you.

Latest SQL Server Service Pack: Same argument as above. You are a big box, learn to live life dangerously.

Automate collection of Spin Lock and Latch Stats: If you are not already collecting the contents of sys.dm_os_spinlock_stats (could someone please ask Microsoft to document this one?) and sys.dm_os_latch_stats, you should be. These DMV become almost as important as sys.dm_os_wait_stats.

Networking

You need to have the best network drivers you can find. With good drivers, the CPU load of the network can be spread evenly over all cores. However, it is often a good idea to affinitise network (and I/O) to specific cores and set up SQL Servers affinity mask to avoid these cores. Getting that configuration optimal is trial and error, but XPERF/WPA helps a lot here as it allows you to pinpoint where the kernel time goes. I have generally had the best experience with Intel NIC and prefer them because I am familiar with the driver stack. I hear that Broadcom NIC are now very good too. You may need to turn of TCP/IP Chimney if you see sporadic disconnect errors from SQL Server (The infamous “Communicating Link Failure”).

You should also make yourself familiar with this paper and the method it describes.

What you are going to experience from here is likely going to depend on the type of workload you run.

OLTP

This workload is characterised by a high amount of user concurrency with a lot of short running queries. Typically, you will be in the tens (or hundreds) of thousands of round trips per second to the server.

The great news is that OLTP workloads are most likely to benefit from a big box, at least if well written. The main issue you will have with OLTP system on a big box is thread coordination. If your main wait stat is locking (LCK) moving to a bigger box is unlikely to help your workload. In fact, it will most likely slow it down because locks take longer to acquire on a big box. A great example of this effect can be seen in TPC-C on the WAREHOUSE table. This table is highly contended with update locks. Moving a small workload TPC-C from a 2 socket to a 4 socket machine will typically slow down the workload.

Once you have eliminated LCK waits, the next thing you have to worry about is latches and spinlocks. Of these, the latches tend to be the most common.

PAGELATCH

The PAGELATCH wait tends to appear in three situations:

Too Few Data files: This typically results in waits for PAGELATCH_UP. They are easy to diagnose and the solution is to add more data files to the filegroups in the database. However, even when your files equal the number of cores, you may still see a little of this wait. Here is an OLTP workload at 48 cores with a varying number of data files:

48CorePAGELATCH

Last page INSERT: Typically shows itself as PAGELATCH_EX. This happens when you use IDENTITY or SEQUENCE to generate keys. I have written elsewhere about this scale problem. The solution is to use GUID or reversed SEQUENCERS for keys instead. fixing this can be a major change in the database design.

Small tables with contested UPDATE: Typically shows itself as PAGELATCH_EX and PAGELATCH_SH. This happens when a small table with narrow rows has a lot of UPDATE statements running on it. The solution is to “pad” the rows so each row is one full page. Adding a CHAR(4000) NOT NULL with a default will typically fix this. On SQL 2014, a default value is not enough, you have to actually change the padding column with an UPDATE statement.

WRITELOG

The transaction log is the final wall for DML heavy workloads like OLTP. Your best bet is to place the log on 2-4 dedicated 15K spindles with a good write caching controller. Don’t put anything else (like data files) on the log drive if you do this. Alternatively, place the log on PCIe based SSD with low latency. You should aim to have the log writes take less than 100 microseconds (0.1ms).

Unfortunately, even if you throw the best disk system at the log, you can still write transactions faster on a 2 socket than a 4 socket. This happens because the log has a central spinlock protecting it (it has to). This spinlock can be locked and unlocked faster on a small system than on a big system. If you see the spin: LOGCACHE_ACCESS grow very fast as you add more work to the system, you know you are bumping into this bottleneck. If this is your bottleneck, you are likely better off on a 2 socket.

Spinlocks

Once you really push an OLTP system, you may see spinlocks increase. It is difficult to say something general about what to do about these. You simply have to diagnose each as you see it and come up with a good guess on how to work around it, or ask PSS for a patch.

Some examples of spinlock bottlenecks I have come across:

MUTEX: I ran into this one while doing dynamic SQL with sp_executesql. The solution was to do more work inside the sp_execute SQL than in the loop surrounding it.

SOS_OBJECT_STORE: I saw this spin in two cases. One of them I blogged about here. The other was related to spins protecting the security data structures. It was fixed by making all users sa – not a nice workaround (There was a SQL Bits session about this one). I think this has now been fixed.

LOCK_HASH: This spin used to be quite common on OLTP workloads but has not largely been patched. From the name, you can guess that it is related to locking. However, you have to drill deeper to understand what lock is causing it. In my particular case, I saw it when a workload used custom schemas (i.e. not dbo). The workload was improved by moving all objects to dbo! I believe this has now been patched as I have not been able to reproduce it on the latest service packs.

Some of the spinlocks of the same name are used in several areas of the engine. Even if my MUTEX spin had a certain solution, it may not mean yours has the same one. You just have to diagnose it using the guidance in the paper above.

The only trace flag I am aware of that need to set to enable the latest fixes is T9024 which is relevant for AlwaysOn systems. If you know any more of these, please comment.

Data Warehouse

Also known as DSS, Analytics, BI, OLAP and Big Data, and other names expensive consulting agencies come up with to sound like they are inventing, instead of renaming things. This workload is characterised by a low amount of query concurrency, but a lot of long running queries that use parallel execution.

First of all, very few parallel queries scale well beyond DOP 16 in SQL Server; index builds being the notable exception. Joe Chang has done several experiments illustrating this and so has Chris Adkin – their experience matches mine. Because of this limitation, it is tempting to configure max degree of parallelism to 16. However, it is worth making yourself familiar with Adam Machanic’s “Magic CROSS APPLY” and his make_parallel procedure.

In fact, if you are planning to run a data warehouse on a big box, you need to read all of Adam’s articles before you do anything else. When you have read them, ask yourself: “Why do I need to do all this to make it work?”. Then, think carefully about what features you REALLY want in the next version (hint: Single table restore or XML indexes are probably no longer that important).

ACCESS_METHOD_DATASET_PARENT

This latch is generally an indication that something is bottlenecking in the query plan. One trick that often works is to hash partition the tables involved. Sometimes, you have to dig into the specific operator in the query plan by looking at the resource_description of the CXPACKET wait.

Network

On a data warehouse, there are two settings you nearly always want to run:

Network Packet Size: Set to 32767. It improves throughput a lot

Jumbo Frames: If possible on your network, enable it.

You are likely going to need at least one or two 10Gbit NIC on a big box. Alternatively, get several 1Gbit NIC.

SELECT INTO and INSERT…SELECT

Until SQL Server 2014, the insert part of these common data warehouse queries is single threaded. You are likely to need some manhandling of the ETL flow to speed up insert speeds on a big box. Refer to this whitepaper:

If you are not able to parallelise these types of queries, you are again likely to see a regression as  you move to a big box. The reason is that small boxes have much lower latency on the transaction log and can run a single thread faster than a big box (because smaller boxes have higher clock rates and lower memory access latency).

I/O and Filegroups

Unless your data warehouse largely fits in DRAM, you are going to need a monster I/O system to feed the database engine. If anyone tells you that SQL Server or Windows cannot do I/O, do this:

SQL Server is a greedy beast, it can eat tens of GB/sec of bandwidth and Windows will happily deliver. Likely, you are going to need some serious SSD (prefer bandwidth over latency for a data warehouse) or a Fast Track design. Test the design carefully before deploying it. When using traditional, spinning media, don’t be afraid to leave a lot of space empty in order to get the speed you need.

If you are tuning for sequential I/O on a data warehouse with spinning disks, it is wise to follow the Fast Track guidance. Even though you may not own a Fast Track reference configuration, the ideas here still apply to you. This means having 1-2 files in per 2 or 4 drives and splitting those drives into a lot of LUN. This can be a real nightmare to manage (you MUST script it), but the performance is often worth it.

NAND flash is now getting very cheap, in the 1 USD/GB range. This means that you can now buy a good “enterprise” flash drive like the Intel 3600 for less than what you pay for SAN storage. The cheaper drives can’t take as many write cycles as the expensive ones, but you would be surprised at just how many write cycles they CAN take. They are likely to outlive the server. Additionally, if you are running a data warehouse, chances are you will be reading a lot more than you are writing. Consider if it is worth the trouble to tune for sequential I-O on spindles, or if you just want to bite the bullet and get a pile of cheap SSD in some big SATA disk shelves.

Backup

On a big box data warehouse, you are likely to need a dedicated network just to handle backups. Again, SQL Server is capable of spamming the backup system with tens of GB/sec. This type of bandwidth is likely to flood your existing backup infrastructure with too much speed. Consider using a dedicated backup system for the data warehouse.

If the database is larger than a few TB, you should be thinking about a filegroup based backup strategy. Remember to back up your PRIMARY filegroup every time (and don’t place any tables or indexes in it).

Summary

In this blog, I have described some tricks you need to apply if you want scale on big boxes. I will be updating this with more details when I have time to write again. However, I wanted you to have early access to this guidance.

 

  15Comments

  1. Chris Wood   •  

    You mention monitoring on spinlocks and latches and that sys.dm_os_spinlock_stats is not described by Microsoft. I see that the spinlocks DMV has columns for collisions, spins, backoffs, spins_per_collision and sleep_time. What would be your priority on these columns? I would like to incorporate these stats in my monthly reporting on Production servers.

    Thanks

    Chris

    • Thomas Kejser   •     Author

      The column I would most worry about is the backoffs. If that is high, a lot of spins give up and turn into waits.

    • Thomas Kejser   •     Author

      That is great news. Much more detail than they ever had before!

  2. Mike West   •  

    Thomas,

    On a SAN using shared drives on the back end with one shelf would you consider letting the data and log files on the same lun?

    Mike

    • Thomas Kejser   •     Author

      If you are sharing drives between the data and log file – no reason to split things up. So yes, if you are on SAN storage and the storage admin cannot give you a dedicated drive for the log – might as well share.

      • Thomas Kejser   •     Author

        Of course, if you do find yourself in the situation where a SAN admin cannot give you a dedicated drive for your log – it is probably a good career move to find another SAN admin or work for an organisation that treats DBAs with respect.

  3. Bruno Martinez   •  

    “On SQL 2014, a default value is not enough, you have to actually change the padding column with an UPDATE statement.” where is this improvement explained?

    • Thomas Kejser   •     Author

      Hi Bruno

      I was not able to find any documentation on it. It was purely by accident I ran into it. It is easy to validate the behaviour for yourself. Simply do this:

      1) Create a table Foo with some data
      2) Run sp_spaceused on the table
      3) Do ALTER TABE Foo ADD Bar CHAR(4000) NOT NULL DEFAULT (REPLICATE(‘x’, 4000))
      4) RUn sp_spaceused again. Observe that the table is the same size as before (unlike in SQL Server 2012)

      • Andrej Kuklin   •  

        Potentially this is the feature of adding a not null column as online operation which was introduced in SQL Server 2012 (Enterprise Edition).

        http://msdn.microsoft.com/en-us/library/ms190273.aspx

        Starting with SQL Server 2012 Enterprise Edition, adding a NOT NULL column with a default value is an online operation when the default value is a runtime constant. This means that the operation is completed almost instantaneously regardless of the number of rows in the table. This is because the existing rows in the table are not updated during the operation; instead, the default value is stored only in the metadata of the table and the value is looked up as needed in queries that access these rows. This behavior is automatic; no additional syntax is required to implement the online operation beyond the ADD COLUMN syntax. A runtime constant is an expression that produces the same value at runtime for each row in the table regardless of its determinism. For example, the constant expression “My temporary data”, or the system function GETUTCDATETIME() are runtime constants. In contrast, the functions NEWID() or NEWSEQUENTIALID() are not runtime constants because a unique value is produced for each row in the table. Adding a NOT NULL column with a default value that is not a runtime constant is always performed offline and an exclusive (SCH-M) lock is acquired for the duration of the operation.

        While the existing rows reference the value stored in metadata, the default value is stored on the row for any new rows that are inserted and do not specify another value for the column. The default value stored in metadata is moved to an existing row when the row is updated (even if the actual column is not specified in the UPDATE statement), or if the table or clustered index is rebuilt.

        Columns of type varchar(max), nvarchar(max), varbinary(max), xml, text, ntext, image, hierarchyid, geometry, geography, or CLR UDTS, cannot be added in an online operation. A column cannot be added online if doing so causes the maximum possible row size to exceed the 8,060 byte limit. The column is added as an offline operation in this case.

  4. Thomas Kejser   •  

    Tobi, if you think about what it means for a transactions to generically be ACID compliant, you will realise that every event in the database must happen in SOME sequence that must be determined centrally. Without a sequencing of events, you can’t implement proper isolation levels.

    It is the dishing out of that sequence number, not the log write itself, that eventually sets the final bottleneck. Every ACID compliant database has this issue. Oracle allows multiple transaction logs, but it only gives you more scale if those logs handle independent tables.

    Of course, if you had a part of the database tnever participated in transactions with anything but it’s own little isolated world, this part could be serialised without invoking a central log manager. This would be the equivalent of putting that part of the database into it’s own database in SQL Server.

    Another simplifying assumption that maintains ACID would be to require that statements would only ever operate on one table or some restricted subset of the rows in a set of tables (for example, a shard). This again allows you to remove the bottleneck. Something NoSQL systems take advantage of.

    • tobi   •  

      Valid points. I’m not convinced, though, that handing out IDs requires a very strict notion global coordination. For example, each CPU core could “check out” 100 IDs in one operation every millisecond. That would assign IDs in almost clock order. Almost (for some definition of this word) is enough when dealing with clocks and time. Once the IDs are assigned their numeric value determines the time ordering.

      I’m under the impression that enqueueing log records is the actual bottleneck in the current product. Given that the contents of a record are determined (using scalable means) it should be possible to enqueue it in a scalable way as well. Hekaton seems to be able to pull it off in its transaction infrastructure (relying on the TLOG only for storage of huge opaque log records).

      A more abstract statement would be that I believe the SQL team could make it work but for one reason or another they haven’t.

      That’s all I’m going to say on this. I’m relying on quite some assumptions here.

      • Thomas Kejser   •     Author

        Toni, if you benchmark Hekaton, you will see that it too will run into the LOGCACHE_ACCESS spin at some point (I generally see this bottleneck at speeds around 1GB/sec into the log).

        You are correct that each CPU could take out a series of numbers at a time. However, you are just moving the problem around. The issue you face with a “batch numbering” is what needs to happen if one of those numbers result in a rolled back transaction. This will affect all the CPU cores who have sequences that overlap with that failed tx. While the mess is being sorted out and log entires are being retired, everything stalls. The bottleneck now moves to hardening the log to find out what the latest successful transaction is – and this again, requires serialisation.

  5. tobi   •  

    The sequential transaction log architecture always struck me as an odd bottleneck. I don’t see why log records must be written in a single stream. There could be multiple streams and readers would dynamically recombine them.

    • JamesNT   •  

      tobi,

      Having to dynamically recombine would then become your new bottleneck. Think about TCP/IP: Packets don’t have to arrive in order and can be re-requested as needed. Sounds awesome until you are watching YouTube and your video begins to skip because some packet that should have arrived at moment 11:30:23 came in at 11:30:55. That is why video sites do so much buffering.

      You are robbing from Peter to pay Paul.

      JamesNT

Leave a Reply

Your email address will not be published. Required fields are marked *