There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.
Let us have a look at the facts for a moment and weigh up if this myth has any merit.
Transaction Logged Data
In traditional, relational, ACID property, rows stored in pages, relational databases we typically distinguish between two types of DML operations from a transaction logging perspective.
- Row Logged
- Allocation Logged
Row logged operations will write a transaction log entry every time a row/tuple is modified. This means that the amount of transaction log traffic generated is proportional to the number of rows touched.
Allocation Logged (called: “Minimal logging” in SQL Server) operations only write the physical allocations to the transaction log, if at all. This means the log traffic (if any) is proportional to the size of the data touched. This typically generates at least an order of magnitude fewer log entries than row logged, and is thus faster… Or is it? Read on…
Typically, ACID databases only allow bulk style loads, index builds and large table/partition truncations and drops to be allocation logged. UPDATE and DELETE statements tend to be fully row logged.
Myth: This difference in allocation structures makes UPDATE “bad” because the transaction log is the bottleneck.
Reality: The picture a quite a bit more nuanced than that.
First of all, INSERT in bulk mode is not always allocation logged. Typically a lot of conditions have to be true for allocation logging to work. For those of you interested in SQL Server, I have written about this extensively here: The Data Loading Performance Guide.
Second, the transaction log bottleneck ”wall” is widely exaggerated. I have personally driven 750MB/sec write log traffic into a single database in SQL Server using a FusionIO card. I have seen colleagues do 120MB/sec with traditional, 15K spindles. True: I have also driven 3GB/sec (around 10TB in an hour) using allocation logged INSERT, which is faster than its row logged sibling. However, you have to ask yourself the question: Do you need to go that fast in a single database? In an MPP system you will also have several log files that work together to provide linear scale of log traffic with no relevant roof.
Third, it is perfectly conceivable that you are running a warehouse database that does NOT need to serialize DML operations or may even write entire blocks directly to the database instead of the transaction log. Such systems simply do not have the above bottleneck. For example, Neteeza uses a “logless” implementation and so does many noSQL database systems and HADOOP/HIVE. MySQL with the MyISAM engine also allows non logged operations.
INSERT/DELETE vs. in-place UPDATE
In an traditional, relational database an update can be implemented either as a transactionally wrapped INSERT of the new data followed by a DELETE of the old data (or the other way around).
Myth: Because UPDATE is an INSERT and a DELETE this can double the amount of data that needs to be written during an UPDATE operation and will make me hit the transaction log wall even faster.
Reality: If you are not modifying the keys in an index or making the column size wider, UPDATE statements can be executed as in-place modifications of the row. This allows the database to only write a special old/new value into the transaction log. This optimization can even be applied on a column by column basis, further reducing the transaction log footprint. This leads us to:
UPDATE vs. INSERT speed
This myth is an amalgam of the above arguments.
Myth: INSERT of row logged data is faster than UPDATE of row logged data
Reality: Let us first settle one thing which I will type on its own line and in red to make it easy to remember:
An in-place, row logged, UPDATE operation on a non compressed page is faster than doing a row logged INSERT of the same data.
Why is this? Because the INSERT operation has to allocate new physical structures in the index, while the UPDATE can simply reuse database pages without having to allocate more space.
And here are the numbers to prove it where I am running INSERT vs. UPDATE of large dataset in SQL Server (smaller is faster):
True: if your UPDATE statement has to do the INSERT/DELETE trick, it will likely be slower. But if you are NOT changing the row size and you get the in-place UPDATE, it might just be FASTER to run an UPDATE than an insert.
Also true: compression can change the game quite significantly depending on the compression algorithm you use for the table structure. This is again implementation dependent.
The myth about UPDATE statements being bad or slow is too simple a way to look at this crucial DML operation. In fact, the myth is outright false in some cases.
Avoiding UPDATE statements should not generally be a major driver for design guidance, or used as the basis for drawing any conclusions about the data modeling techniques you should apply. There picture is quite a bit more nuanced than this.
First of all, the speed of UPDATE statement as compared to bulk inserts will depend on the database engine you run on.
Second, we have seen that even when UPDATE is fully row logged, the transaction log “wall” is very far away on proper hardware and not much of a concern to 99% of all the installations out there. There are of course cases where you will hit the “wall”, but those are largely mitigated in MPP systems or other sharded deployments that have more than one transaction log.
Third, there are cases where UPDATE statements are actually faster than (row logged) INSERT statements. These typically occur when you change columns in such a way that they don’t grow larger than they already are, allowing in-place UPDATE operations. An interesting and highly relevant example of such a case is UPDATE statements that target fact table keys – which (if you follow my guidance) are integers and therefore have constant width.
Bonus Exercise and chance to win (for SQL Server people): Here is an interesting experiment. In theory, it might be possible to create a workload where you UPDATE a very wide table and where the equivalent, minimally logged, super optimized “copy to new table” BULK INSERT or SELECT INTO statement is actually slower than the UPDATE. Where is the crossover point on table width? I will offer a free, 1 hour teaching session. Database subject of your choice, you host me, I bring the coffee in the London City area to the first person who can provide a test script and the data to show this crossover point. Alternatively the price can also be claimed if you can conclusively prove that the crossover point does not exist.
More myth busting on the: DW and Big Data page