As many of you have noticed, I have been flirting a lot with open source database lately. I am no longer spending much time going to SQL Server conferences.
About two years ago, I decided that it is time to diversify my knowledge and spend more time with other database products. Back then, this was largely based on a hunch that Open Source is now getting to the point where it is worth looking into. I didn’t really have any data to back up my decision to lower my investment in SQL Server – it was mostly intuition, a feeling for the zeitgeist.
Today, I wanted to asses where things are. It is difficult to get information about the popularity of databases that is not heavily biased, so I decided to mash up some data myself.
The Data Source
These days, one of the main sources of information about database is dba.stackexchange.com (referred to as SE from now on). All the cool SQL Server guys hang out there (and some of the not so cool ones, including myself). Question about MySQL get answered quickly, Postgres questions are frequent now. Fortunately, it is possible to download all this SE data in an easy to read XML file from here:
Based on the information in there, we can analyse things like:
- How many questions get asked about each database?
- What activity happens on those questions?
Admittedly, answering these questions is a rather crude way to measure the popularity of databases – so take this with a grain of salt. For example, a lot of SQL Server knowledge flows on forums on the web and never touches SE. Nevertheless, when novice people ask a questions in a known forum of experts where they can expect an answer, I think the amount of questions asked and the interest in them is a pretty reasonable indicator of interest. It is almost certainly less biased than some sponsored report by one of the big analyst agencies (and I use the term “analyst” lightly).
Based on the XML data, you can quickly mock up an analysis. For those of you interested in reproducing or extending the analysis, I will add the script in the comments.
Here follow the analysis in the time period from January 2011 (where SE took off) until April 2014.
First, I analysed how many questions have been asked about each database to date. This is to get an idea of the built up knowledge SE stores:
(note: SQL Server questions also include Analysis Services)
Clearly, SQL Server is in the lead here. More questions have been historically asked about it than about any other database. We can also see that MySQL is the most popular of the Open Source databases. NoSQL and others barely shows up in the data, though I suspect NoSQL users often look at other places than SE for answers so this may be an unreliable indicator.
So far, it looks like my intuition is wrong. SQL Server is the dominant database out there.
But not so fast. Let us instead look the number of questions asked per month:
The picture is less clear now. MySQL and Postgres are clearly gaining in popularity, while SQL Server seems stagnant.
Questions are not the only indicator of community engagement. Comments also indicate interest. The SE data dump provides those too:
Trends here are a little less clear. A stacked plot shows it better:
SQL Server is holding its lead position on engagement in the SE comment section. Seeing how much activity there is on still controversial subjects (like heaps vs. clusters and GUID vs. IDENTITY) – this is not surprising.
Continuing the analysis, we can have a look at how good the community is at providing answers to questions. Here are the answers per month:
The strong SQL Server community is clearly seeing some challengers from MySQL and Postgres. There is a clear drop-off in answer activity recently on SQL Server.
And here is the average time it takes to get an answer:
Looks like those Oracle people haven’t fallen asleep over their Exadata machines just yet…
I have previously stated that I think SQL Server is losing mindshare in the database community. From this simple analysis of SE data, it looks like there is indeed some traction in the open source community that can pose a serious threat. MySQL is going strong, and Microsoft would be wise to fear it.
Competition is good for us consumers of databases and I am happy to see that the open source community is finally catching up. I am also happy to have spent more time with open source databases to see it happen from the inside.
Looking at the crystal ball, I can see the market for relational databases change dramatically if the players modify the way they operate. Here is my not so humble advise to the communities and corporations.
Start listening to your high end customers who know what they are doing. They can lead the way to the features the small customers will love. Stop wasting time developing stupid features no-one who knows what they are doing need (ex: Everything “Beyond relational”, Tabular models, 10 different ways of writing reports). SQL Server 2014 is finally showing some signs of fixing bugs that are 5-10 years old. Taking that amount of time to fix stuff is simply not acceptable and you are losing your customer base because of it. Stop closing Connect bugs as “by design” when you can’t be bothered to fix them, because it is immensely disrespectful of the people who really DO know better and who think that if this is your “design” – you don’t. Fix whatever is wrong with your development organisation to speed up bug fixes rates.
Management studio does not need to be released in band with the core engine. This application is what makes SQL Server the “iPhone of Databases” and it needs to work flawlessly (it doesn’t).
SQL Server was positioned to completely rule the database market – this eminently defensible position is being squandered and a multi-billion dollar business is going down the drain. Wake up Microsoft, the enemy is at the gates!
Stop innovating, start fixing!
You need to come to terms with the fact that your toy database, even though it runs Facebook, still does not hold a candle to Oracle and SQL Server in the Enterprise. Web 2.0 is not going to make big, corporate IT go away. Who do you think funds all those adds and buys all the data that powers Google and Facebook? If you want respect in the corporate market you need to change the tune you play.
Manageability (Thanks Percona), query parallelism that just works “out of the box” and a good optimiser than can do all join types is absolutely crucial. Single threaded index build is simply embarrassing – you should know how to do parallel sort.
Backup/restore is too bloody complex – you need a simpler, unified solution (see SQL Server). You need mirrors, cluster and other HA features to simply work and be part of the core release (again, see SQL Server).
The features and performance optimisations you are building into the core product have been around in SQL Server 1998. Claiming you can do what it can do, simply shows you have not studied SQL Server very well (see previous comment on toy databases and changing your tune). All the forks (Maria, Percona, Oracle, Toku etc.) are not helping you. It makes running MySQL and keeping up with releases and fixes a nightmare and it slows everyone’s development speed down. You are at least 10 years behind SQL Server and Oracle – so you need all the development speed you can get and an army unified under a single goal. Pool your limited development resources and go take the battle to SQL Server properly.
Being ANSI compliant and having a beautiful optimiser (great work by the way!) is not the only thing that matters in the database world. You need to get your basic performance act together. Doing kernel calls into the OS all the time and trusting that the Linux/FreeBSD scheduler, memory manager and I/O system will do the right thing “because Linux is Awesome” is not good enough. SQL Server switched to a user space scheduler and careful memory management in version 7.0 (released 1998) – not because Windows sucked (it did, about as much as Linux does today) but because the first thing you realise when coding for high performance is that the OS gets in your way more than it helps out. This is not a contested insight or a matter of opinion – it is the way performance is done. Go through every single malloc, spinlock, mutex and I/O path and take the optimiser’s mind-set to it. Build scalable data structures and streamline the code base. Snap out of the academic mind set, take performance seriously. This may mean you have to take bug fixes from a very wide array of people and find new allies. When you can run TPC tests on 2 socket machines with performance on parity with SQL Server and Oracle, you are getting somewhere. Your failure to run TPC well is not due to TPC being an evil test (it is) and Microsoft and Oracle tuning specifically for it (they do) – it is mostly because your data structures came out of a computer science 101 class and are not state of the art. Forks have been done that proves that great performance is possible with Postgres – it can be done and there are some very smart people in the Postgres community.
No matter how hard you try, the optimiser will never be perfect because it heuristically tries to solve an NP complete problem. You need to come to terms with this fact: You need query hints in the core release. This hint/no-hint discussion has gone on in Oracle and SQL Server for 20 years now – the conclusion is clear: Implement hints because the ones who are smart enough to use them right are the people you want on your side. Yes, people will shoot themselves in the foot with them – that is the price you pay for power.