Shooting yourself in the foot

SELECT INTO – Moving Data From A to B

When building a data warehouse, you often find yourself needing to move data that is the result of a query into a new table. In SQL Server, there are two way to do this efficiently:

  1. SELECT INTO
  2. INSERT INTO WITH (TABLOCK)

While both techniques allow to you achieve bulk logged (i.e. fast) inserts, I am going to argue that method 2 is preferable in most situations.

SELECT INTO

The typical pattern during ETL is this:

It’s a very convenient way to move data. However, here are the drawbacks:

  • SELECT INTO creates a new table, but it does not allow you to fully control the NULL’ability of the columns
    • Data types (and to some extend NULL’ability) can be controlled by explicitly casting columns, but this quickly results in messy and hard to read code
    • This lack of data type control is particularly problematic when creating tables that need to be partition switched.
  • The Target table is always created in the default filegroup.
    • This may be good enough for some installations, but as the warehouse grows, you need more control of where things end up.
  • While this statement runs, it locks up the row representing Target in the system tables (which ultimately bubbles up to locks on sys.objects).
    • Because SSMS (and SMO) does not use NOLOCK hints when reading metadata, the GUI will appear to hang while the statement runs.
    • This is going to annoy anyone using the database
  • When building a data warehouse, it is quite inconvenient that tables appear and disappear depending on which state the ETL flow is in. It makes it hard to track dependencies

INSERT INTO

There is another way to achieve bulk logged speeds without resorting the SELECT INTO. Typically, the pattern is this:

If Target is a heap (which it will be if you were doing SELECT INTO) this achieves the same thing, but without all of the drawbacks. This even works with clustered indexes under Trace flag 610.

Being able to see the target table before inserting into it makes code a lot easier to read, debug and dependency track. It also allows you to create the target table as a partition switch target with all the right constraints and data types.

Obviously, this pattern is to be preferred as it gives you much more control over what is going on.

However…

SQL 2014 and Parallelism

It is my understanding of the new (and long overdue) parallel SELECT INTO feature in SQL Server 2014 only works with the SELECT INTO pattern, not with INSERT INTO. Disclaimer: I have not yet had a chance to test this.

IF my understanding is correct, and I hope it is not, this is most regrettable and seems like an oversight by the SQL Server team…

  6Comments

  1. Phil Grayson   •  

    Correct, parallel inserts only work for SELECT INTO operations and appear to be about 3 times faster with a MAXDOP of less than 8. After that it’s slightly slower than the original single threaded approach. (Based on testing from 4 months ago).

    • Thomas Kejser   •     Author

      Thanks for confirming Phil. Bit of oversight by the SQL team there.

      • Phil Grayson   •  

        Possibly one of the reasons it has been omitted from nearly all of the marketing buff so far and isn’t that well known within the Redmond camp either.

        We spoke during your time at Fusion and since then I’ve been pestering MS to implement this but unfortunately, high end servers will actually end up being slower unless configured correctly.

  2. Jason L   •  

    Could be interesting when combined with trace flag 610 if the data is sorted and the destination is clustered, though it’s still a single process at a time. I’ve been moving a lot of data from legacy Oracle systems into SQL and have found that using parallel selects (selects using a range of rownums) and multiple SQLBulkCopy objects with TABLOCK to insert into a heap works well in that situation, providing full utilisation of available resources. My guess is that SQLBulkCopy is using the same INSERT INTO WITH (TABLOCK) so multiple sessions is a way to provide parallelism, if a bit messy.

  3. tobi   •  

    Your observation about 2014 is correct and it cripples the feature to a large extent. Also, I observed very bad scaling when I tried it. Neither disk nor CPU were fully utilized over long periods of time. I had expected that an embarrassingly parallel heap bulk insert would at least saturate the bottleneck resource.

    At least we now have a way to load data and create an index fully parallelized on a single connection. First load to heap, then create index. If it weren’t for the scalability problems with SELECT INTO.

    • Thomas Kejser   •     Author

      So it looks like the best way is still the old fashioned “spin up parallelism yourself” approach.

Leave a Reply

Your email address will not be published. Required fields are marked *