When building a data warehouse, you often find yourself needing to move data that is the result of a query into a new table. In SQL Server, there are two way to do this efficiently:
- SELECT INTO
- INSERT INTO WITH (TABLOCK)
While both techniques allow to you achieve bulk logged (i.e. fast) inserts, I am going to argue that method 2 is preferable in most situations.
The typical pattern during ETL is this:
DROP TABLE Target
It’s a very convenient way to move data. However, here are the drawbacks:
- SELECT INTO creates a new table, but it does not allow you to fully control the NULL’ability of the columns
- Data types (and to some extend NULL’ability) can be controlled by explicitly casting columns, but this quickly results in messy and hard to read code
- This lack of data type control is particularly problematic when creating tables that need to be partition switched.
- The Target table is always created in the default filegroup.
- This may be good enough for some installations, but as the warehouse grows, you need more control of where things end up.
- While this statement runs, it locks up the row representing Target in the system tables (which ultimately bubbles up to locks on sys.objects).
- Because SSMS (and SMO) does not use NOLOCK hints when reading metadata, the GUI will appear to hang while the statement runs.
- This is going to annoy anyone using the database
- When building a data warehouse, it is quite inconvenient that tables appear and disappear depending on which state the ETL flow is in. It makes it hard to track dependencies
There is another way to achieve bulk logged speeds without resorting the SELECT INTO. Typically, the pattern is this:
TRUNCATE TABLE Target
INSERT INTO Target WITH (TABLOCK)
SELECT ... FROM SourceQuery
If Target is a heap (which it will be if you were doing SELECT INTO) this achieves the same thing, but without all of the drawbacks. This even works with clustered indexes under Trace flag 610.
Being able to see the target table before inserting into it makes code a lot easier to read, debug and dependency track. It also allows you to create the target table as a partition switch target with all the right constraints and data types.
Obviously, this pattern is to be preferred as it gives you much more control over what is going on.
SQL 2014 and Parallelism
It is my understanding of the new (and long overdue) parallel SELECT INTO feature in SQL Server 2014 only works with the SELECT INTO pattern, not with INSERT INTO. Disclaimer: I have not yet had a chance to test this.
IF my understanding is correct, and I hope it is not, this is most regrettable and seems like an oversight by the SQL Server team…