SQL Server Logo

Running SQL Server on a Big Box–Scalability

I have previously blogged about my best practices for installing a SQL Server. This guidance applies to all the boxes I work with.

However, I think that some special attention is required when it comes to running SQL Server on a “Big Box”. This blog is attempt to provide an overview of that guidance

What is a Big Box?

Before I can explain the problems we will face, I need you to get up to speed on NUMA – Non Uniform Memory Access. For an Introduction, I recommend reading Thomas Grohser’s excellent deck:

The deck speaks about Itanium processors, but his points are just as relevant to a modern Xeon or AMD.

It used to be the case that NUMA memory only became a problem on 4 socket boxes. Today we are seeing NUMA on a 2 socket – and in the case of AMD – on one socket boxes. The moment the core count goes up and the memory latency goes up with it – you are entering into the real, of “Big Boxes”. To use a pragmatic definition, let us just say that a Big Box today is anything with more than 16 physical cores and 128GB of DRAM.

Why are you Buying a Big Box?

Typically, you buy a Big Box because you want more performance or you want higher density in the rack. For the purpose of this blog, I will ignore the case where you use a Big Box to consolidate or run a lot of virtual machines at high density. I  will focus on the performance and scale case. There are some classic mistakes people make at this point. For a Big Box to make sense, the bottleneck in the system should be either CPU or I/O.

To confidently say that you have a CPU bottleneck, your workload must be able to drive your CPU cores, all of them, to 100%. Using 70% of your CPU is not being CPU bottlenecked, it is being under utilised. Let me state this in very clear terms: Cores do not run “better” when they are not at 100% – they are not a bottleneck until they are fully used.

In the case of I/O – you can buy a Big Box to fit in more DRAM. With more DRAM, you can cache more, and with more cache, you don’t have to do as much I/O. Of course, this only makes sense if you are mostly reading data. More DRAM is not going to make writes faster. Do you know what your read/write ratio is? If not, go measure it (if you don’t know how, ask me in the comments). When buying a Big Box to fit in more DRAM, you should be aware that a modern 2 socket machine will hold 1TB of DRAM. However, you should be aware that DRAM module come in different sizes – but that there is only a limited amount of slots available on the mainboard. Because a 32GB module is less than half the price of 64GB module, it sometimes makes sense to buy a Big Box to get the cheapest possible DRAM.

Be aware that hardware vendors typically incentivize their sales people to sell Big Boxes. They do this because the profit margins on a single 4 socket is often much more generous than 2 x 2 socket machines. Because of this, you should take anything a hardware sales person tells you with an extra grain of salt.

Storage and Networking

If you bought a Big Box to remove I/O and replace it with DRAM – you are probably already worried about storage and have now done something to remedy the situation. However, if you bought the box because you need more CPU, you need to start worrying about storage right now.

Big Boxes have an enormous amount of PCIe bandwidth coming into the CPU cores. So much in fact, that most storage systems have a hard time keeping up with them. Big boxes eat SAN’s for breakfast and SSD’s before lunchtime, feast on Direct attached storage for lunch and finish the day with a light meal of Fusion-io, Micron or Intel PCIe drives.. Even then, they still go to bed hungry.

They also like high speed networks (or just a lot of 1Gbit NIC)

To give you an idea about just how much a Big Box will eat, here are some examples of what I have either done or seen done:

  • A single, 4 socket box running a data warehouse will eat all the bandwidth of two dedicated, fully outfitted. HDS SAN… or 4 fully outfitted Symmetrix VMAX. This assumes no-one else are using the SAN. If anyone else is there, the greed of the Big Box will make their life unbearably slow
  • A single, 4 socket box will eat all the bandwidth of  12 high speed Fusion-io drives
  • A single, 8 socket box will single handedly consume all the bandwidth (laid out forperfectly sequential throughput) of 32 Direct Attached SAS disk shelves.
  • A single, 2 socket box will drive the transaction log of SQL Server to 1.1GB/sec and require that you sustain an average latency of under 500us (that is 0.5ms)
  • A single, 4 socket machine I tuned ate 4 x 10Gbit Ethernet all on its own.
  • A single, 2 socket machine ate 4 x 1Gbit Ethernet processing cubes and running ROLAP queries.

Big Boxes are voracious feeders – you need to design I/O systems just for them. If you don’t have enough DRAM to keep that Big Box happy – you are doing to need a serious investment in I/O.

What to Expect

If you expect that you can just move your database over to the Big Box and get more performance, I am afraid you may be very disappointed. It does not work that way.

The majority of systems I have worked with moved their database to a bigger system and LOST performance. The system became SLOWER when they put it on more hardware. They insisted on working with the hardware vendor or their IT department to “fix things”. And often, they refused to change any code – because at the price they paid, surely that box must be better? Well, it isn’t…

Why does this happen?

Big Boxes are different than their smaller family members in some very important ways.

Clock Speed / Core: Typically, a large machine has more cores, but less clock cycles per core. It adds up to more CPU instructions per second – but unfortunately this does always mean more speed. In order for a system to take advantage of a lot of CPU cores – it must be multi threaded and cleverly written. Most queries people write are not. And SQL Server’s execution engine has many bottlenecks that only show themselves at higher core counts.

Memory Latency / Core: Big Boxes have more aggregated memory bandwidth than smaller boxes. But they have less memory bandwidth per individual core and higher cache latency between those cores.  Because latency is higher, it takes longer to lock and unlock semaphores – which is what latches and locks are inside SQL Server. Spinlocks spin longer (burning more CPU) and when one core needs data from another core (for example when doing aggregation) it takes longer to get it. This is where NUMA comes into the picture, because you have to be careful with what memory you actually touch. Unless code is written specifically to take advantage of the aggregated bandwidth in clever ways – the latency ends up being more expensive than the gain in bandwidth. Just in case you were wondering: No, SQL Server 2014 did not do a very good job at this. My friend and Padawan learner Chris Adkin has done a very interesting blog series about this:

Patching, and configuration: Bigger boxes are not mass produced the way small boxes are. This means that fewer people are dedicated to engineering them and patches don’t happen as fast. It also means that a lot of things are simply unexplored, because so few people run those boxes. Sadly, the available knowledge about a specific box is typically poorly documented – because people who are smart enough to build big boxes tend to be impatient enough to not bother with documentation. Whenever you get a big box, make sure you patch up the firmware with the latest and greatest if you can find it. Also, get yourself a good connection inside the big box vendor to stay up to date and get advise on how to tune it along with any documentation they can find. Insist on getting a tuning guide for the box. Even then, get used to living on the bleeding edge with a big box.

Queries will Change and so will your Code: All of the above conspires to make big boxes harder to tune. Things break or slow down when you move stuff to them. If you want to take advantage of a Big Box – you need to change your code to work well on it. Even if you end up running away screaming from the lab – this is a worthy exercise, because it teaches you where you real bottlenecks are and when you will run into them.

What to Do about It

All this sounds like you should just stop right here and never buy a Big Box. Very often, this is good advise. What should you buy instead?

The short version is that you should buy:

  • The fastest CPU clock rate you can find
  • The biggest L2/L3 caches
  • The latest and greatest Intel mainboard and CPU
  • The highest clocked DRAM you can find (NOTE: sometimes getting the max speed means putting a little less memory in the box – less is more)
  • SSD, preferable the stuff that sits right on the PCIe bus
  • 10Gbit Ethernet with Jumbo frame enabled switches (if you run a data warehouse)
  • A dark room with a big, padlocked door where you can lock away your SAN admin while you put direct attached storage on the box.

But sometimes, the above just isn’t enough (for example, you may need to lock away your network admin too).  Big Boxes fill an important gap. They CAN help you scale beyond the limits of a small box – if you do it right. They are also getting cheaper and cheaper and will soon be in a server room near you. In the right hands, they can be magnificent beasts – if you know how to tame them. If you tune a Big Box right – it will typically outrun most  scale-out design you can come up with. This makes scale-out programmers run crying back to their high latency clouds when they discover their performance in their fancy new NoSQL “Eventual consistency” system is off by two orders of magnitude. This experience alone makes it all worth it and the business will think you are a hero.

Incidentally, taming big boxes is what the next instalment of this blog series will be all about. Stay tuned everyone, thanks for listening.

  9Comments

  1. Pete Ads   •  

    Nicely written Thomas; I recognise a couple of those stories 😉

    • Thomas Kejser   •     Author

      Hi Pete

      Thanks. Yes, we have indeed seen a few of these together and even caused a few patches to happen 🙂

  2. Brett Baggott   •  

    Under “What To Expect”, Clock Speed / Core section, you say, “but unfortunately this does always mean more speed” when I’m sure you mean “doesn’t”. No, I’m not trying to be the pedantic police. This is just such an awesome blog post that I don’t want that typo to cause anyone else to be pulled out of the moment while stumbling over that. Great post and I’m so glad you decided to turn it into a series. Looking forward to more.

    • Thomas Kejser   •     Author

      Hi Brett

      Really good catch, thank you!

      I have corrected the post. Thanks for taking the time to report this. I hope you enjoy the second part of the series.

  3. Chris Adkin   •  

    Thomas,

    I’m correct in thinking that with NUMA architectures there is a coherancy overhead when reading into main memory (unless you are using Data Direct IO which came in with Sandybridge), does this overhead go up as you have more sockets, I’m inclined to think it would do, but I’m sure you would know, hence my question.

    Chris

    • Thomas Kejser   •     Author

      Hi Chris

      I am not sure I follow you here. What do you mean with reading into main memory? Do you mean from the PCI bus?

  4. Chris Adkin   •  

    Thomas,

    1. The more sockets and cores you have, if you genuinely have the workload to utilise this CPU capacity, the more likely it is that you will put pressure on single points of contention, log cache spin lock, SOS_CACHE_STORE spin lock etc. The use of affinity masking, to isolate the log write to core 0 and as you say the fastest clocks and largest caches you can get your hands on help here. I’m guessing that the more sockets you have the more likely it is for the cache lines associated with these spin locks to bounce from socket to socket.

    2. For your Padawan comment, for any of my decks you review / contribute towards you will appear at the back Kejser-Yoda, I just need to master photo-shop enough in order to morph your picture into that of Yoda’s 😉

    Chris

    • Thomas Kejser   •     Author

      Padawan learner… 🙂

      1) Correct. Not only will you put more pressure on the contention points – it will also be more expensive for each contender to acquire the lock/latch/spin. It’s a double whammy

      2) I am going bald again – so all you need is to catch me on a Sunday morning (where I have the right skin colour) and photoshop on some big ears.

  5. strut   •  

    Thank you for this post (series). You provide information that is available nowhere else.

Leave a Reply

Your email address will not be published. Required fields are marked *