Domino pieces about to fall

High Availability, at High Speed

In the quest for 100% uptime, a great many hours must be invested in careful design. Even when you think you have eliminated every Single Point Of Failure (SPOF) – something new always shows up. It get’s worse: All this the effort is multiplied if you want to BOTH achieve high availability AND run a system at high speed.

In this blog, I will share some lessons we learned the hard way while tuning our high speed SQL Server mirror.

Server Design

At the centre of our 1M user system is a SQL Server handling all of our logins and data requests. Every second, we are hammered with tens of thousands of batch requests on this server.

Through careful tuning and good schema design, we are able to handle all user requests in average response times, less than one millisecond – even when we spike at over 100K request/sec. This can be proven by querying sys.dm_exec_query_stats on our system.

High availability is achieved via a SQL Server Mirror on  2008R2. Because we don’t rely on clustering services and storage replication – both the database and OS are very easy to manage. The only thing you have to be a bit careful about is to make sure all login on both the primary and secondary are correctly mapped.

We don’t have a storage administrator and we don’t own a SAN.

The server, although a bit old by today’s standards, has been designed to be both fully balanced on performance and high availability. This is achieved via this configuration in each node on the mirror:

  • Two Socket Intel E5-2680 CPU
    • 2 sockets  minimise NUMA latency.
    • This also gives us the highest possible clock rate
    • Lesson Learned: Today, I would pick a Sandybridge with the highest possible clock rate and CPU caches. This would give me more bang for the buck for my core licenses
  • 256GB of DRAM
    • Because memory is the cheapest performance you will get.
    • Lesson Learned: A lower amount would have allowed us lower access time, which might be a worthy improvement
  • 2 x 1Gbit NIC (4 ports on each NIC) for application and mirror networks
    • 2 x NIC teams (2 x 1Gbit in each). One team for mirror traffic, the other for the application network
    • The teams both span the physical NICs so we can lose a card without going offline.
    • Lesson Learned: We actually have more NIC pairs available than we use. NICs are cheap. Since kernel times from the NIC is getting rather high (about 50% on one core non stop) it would have been nice to have multiple network listeners for the app servers.
  • 2 x 10Gbit NIC for the backup network
    • Again, teamed for high availability
    • Lesson Learned: Probably better to run each NIC on it’s own (for reasons that shall soon become clear)
  • 3 x ioDrive2 Duo
    • Presented as 6 LUN to the server. Default formatting and 4K sector size
    • Striped, using a Windows stripe
      • We already have a copy of all data on the mirror and can restore an additional copy in a few hours. This means we are not so worried about losing a member of the stripe and taking the entire disk down. Since the Fusion-io cards are located inside the server, replacing a failed drive requires downtime anyway.
    • Data files and log files placed on the single 6.5TB mega stripe.
      • IOPS, like memory, is a global resource. Pool it together in the server for maximum throughput and minimum latency
    • These Fusion-io babies run at around 600MB/sec non stop with great latency consistently under 1ms.
    • Lesson Learned: Today, we would probably have gotten away with Intel SSD which are now good enough for this type of configuration and could run a RAID10 at the same price. However, the Fusion-io has served us very well and the latency is simply amazing.

SQL Server Configuration

Noteworthy configuration settings, apart from the ones described here: Default Configuration of SQL Server) are:

  • SQL Server is configured to use 245GB (leaving 11GB for OS etc.)
  • Max Worker Threads set up 8192
  • User Connections: 7500. To avoid the situation where a rogue app server spawns too many connections and then thread staves essential background tasks.
  • MAXDOP set to 8
    • We don’t really want parallel queries except for a few extracts
    • Lesson Learned: I would have preferred to have this set to 1 and have the reporting queries override the setting. But too much code to trawl through to fix.

Databases and Backups

There are 11 databases hosted on the server, 9 of them mission critical. All the mission critical databases run synchronous mirrors. The largest database is 4.5TB.

At peak usage, our Log Bytes Flushed/sec is 80MB/sec.

The database have all their data in a single filegroup (PRIMARY) and a single file. Lesson Learned: Always remember to allocate things correctly at database creation. Once you have a table that is 3TB large, you are not going to be rebuilding indexes anytime soon (Even with online rebuild). PAGELATCH_UP (From PFS contention) is a major wait type on the server spending an average of 80ms when we are under peak load.

A full backup is run twice daily of all databases. Data is moved to another server with a big disk shelf of SATA drives. Transaction log backups are run every minute (But used to be every 15 minutes – see later).

When a new database is introduced, it is first test driven in SIMPLE mode while we quantify the amount of long running log traffic it generates and get some idea of the query workload. Once we are happy with it, it gets moved to synchronous mirror mode and FULL recovery.

Tempdb has 16 files, each 4GB and a 64GB log file. Autogrowth is disabled on tempdb.

User databases have autogrowth disabled on the log file and autogrowth set to 4GB on the data file.

Schema and Tables

One table in the system represents  80% of all storage. 95% of all data resides in four largest tables and 99% resides in the largest nine.

The workload is about 50/50 read/write.

Our largest table has a clustered IDENTITY(1,1) column as the primary key. This table is the source of our largest wait type: PAGELATCH_EX (higher than both WRITELOG and  PAGEIOLATCH added together) – with an average wait time of 8ms on the latch. We are currently in the process of moving this table to a cluster on a hash key. See the problem described here: Clustered Indexes vs. Heaps. Lesson Learned: Clustered IDENTITY(1,1) columns are evil and will come back to bite you when you grow the database.

We never rebuild indexes, though we run the occasional index defrag after we have deleted a lot of rows in a table. Lesson Learned: Big OLTP databases run perfectly fine without index rebuilds. There really isn’t much reason to do rebuilds if the server is spec’ed right.

Schema change control is handled via a Bitbucket repo using a GIT flow strategy for branching. Stored procedures are simply overwritten directly from the repo when they are merged to the master branch, which happens via an approved pull request. Table changes are done by first deploying the schema to a test server and then running a SQL Compare and validating the result (to make sure no downtime is introduced).

A Nasty Surprise

With the exception of a few blips on the network, things have been running pretty smoothly for us. Our Linux boxes are much more likely to crash with kernel panic than these Windows machines (these box have never done a BSOD).

Everything is redundant in the system, so service failure is not an issue for us. Runaway queries and app servers grabbing too many connections present a much larger problem than the database and server. However, there was one thing we didn’t think of…

Recall that this system runs at 80MB/sec on the log file and does full database backups of 4.5TB of data twice per day. With the low latency Fusion-io for log recovery, this makes our restore time super fast. However, with such frequent backups and a high insert rate, the result is a pretty significant amount of bandwidth on the backup network. Our original configuration was to take transaction log  backups every 15 minutes (historical choice) – around 50GB of log is generated in this time. Combined with the full backups, the data flow of the system looks like this under stress:

Mirror Speed

Can you see the problem? What happens if the backup server suddenly slows down… or disappears?

It just so happened that we had a DBA install a SQL Server on our backup server to test a new restore process. While this restore was running, the backup server ran out of IOPS and we could no longer sustain the required 250MB/sec throughput to keep the log free space at a reasonable size. To add insult to injury, this test restore ran on a Friday night and the alarms went off when our log files (Which at that point were set to auto growth) had eaten nearly all of the disk space.

In such a situation, there are really only two ways to recover:

  • Backup the log
  • Truncate the log by backing it up to NUL: (Windows version of /dev/null/)

The first option isn’t that easy to execute when you are out of space and have not yet diagnosed why the backup server isn’t keeping up. Furthermore, even when you get the full speed of the backup link going, you still need to wait for the backup to catch up. Pray that this catch up time does not overlap with the time you take a full database backup.

The second option is not an option at all. If you backup to the NUL device, the backup chain in broken. When it is, the option to do a point in time recovery is no longer available.

We did the right thing and picked the first option. Thanks to our high disk speed on the Fusion-io drive, we quickly backed up 750GB of log (in less than one hour) and reclaimed the disk space by shrinking the log. The database could now auto grow again and we could return to business as usual.

Lessons Learned

There are some important lessons to be learned here:

  • Make sure your backup system and the network can handle the traffic the mirror generates.
    • The backup system is a SPOF too.
  • Carefully monitor the log growth and usage to check for anomalies.
    • Sound the alarm before you start eating disk space
  • Reserve bandwidth and disk speed for the log backup running on your database mirror.
    • Make sure other backups (like full database backups) won’t eat from the same speed pool as the log backup.
    • I would have preferred a dedicated NIC and spindles for the log backups in this case.
  • Get yourself some NAND Flash – so you have some real recovery options when the proverbial faeces hits the fan
  • Always cap the log file size so you don’t eat away at disk space you may need for database file growth.
    • Alternatively, don’t auto grow the log at all, pre-allocate it
  • Maximize your transaction backup frequency.
    • This keeps the backup network consistently busy instead of bursty – which makes better use of IOPS and network links speed.
    • We switched from 15 minute intervals to 1 minute intervals after this event. We get a much smoother backup experience now.

A Useful Script

I wrote the following script to help our Ops team monitor logs and data growth. I hereby share it with you in the hope that it may be useful: