Shooting yourself in the foot

BULK INSERT with Linux Line Endings

After all these years, Mac, Windows and Linux still cannot agree on what special characters are used to represent a new line in a text file.

Windows generated text files will typically end a line with CR+LF (0x0D0A). Unix/Linux prefers LF alone (0x0A) and old versions of Macs will use CR alone (0x0D). See this Wikipedia entry for details.

When receiving and loading text files into databases, this causes a mess unless you are careful. Unfortunately, this is particularly true for the BULK INSERT command in SQL Server because the documentation is misleading.

To illustrate, consider this example table

And this input file:

For your convenience, I have provided 3 sample files you can use to demo this:

Let’s start with the simple case, the Windows format (CRLF). Conventionally, CR is escaped as /r and LF is /n. Because of this, you would expect this to work:

But it breaks on the CRLF formatted file, as you can quickly validate by running:

However, this magically does the trick:

In other words, \r is not understood  by SQL Server to be CR. But  \n represents both CR and LF. This is very counter intuitive.

What does one then do when receiving a file generated from Linux that uses only LF? You would expect this to work:

Or maybe this:

But neither does the trick. How do you specify “just LF” as the end of line terminator when bulk loading a CSV file?

Books online has an answer in Example C here: BULK INSERT. This is so ugly that I wont even quote it on this blog.

The solution

I took me a few tries to figure out how to do handle LF terminator files properly with BULK INSERT (it is trivial from SSIS). It turns out that the ROWTERMINATOR argument takes ASCII codes as input – if they are properly formatted. As far as I can see from this: Specify Field and Row Terminators  this feature is undocumented and the documentation is quite misleading with regards to escape codes for special characters.

Here is the trick that works without relying on dynamically executed SQL:

And similarly for old Mac formats:

 

  5Comments

  1. João Victor   •  

    Thanks mate you saved me a lot of trouble! 🙂

  2. Vincent   •  

    Thank you Thomas for the insight. I have an embedded CRLF in a free text field that i want to ignore. The bulk insert column separator is “|” . Will this work as the rowterminator, “**|**\n” , instead of using a format file for a bulk insert? How does bulk insert interpret “**|**\n” anyway?

    • Thomas Kejser   •     Author

      Hi Vincent. You could use a more advanced terminator. But the easiest is probably to use quoted strings to hide the CRFL. BULK INSERT should able to do that.

    • Thomas Kejser   •     Author

      Yes, I found it rather epic. The bulk code is pretty old as far as I remember. Saying that, it has held up surprisingly well over the years. That code path can still munge 3-4GB/sec while loading into SQL Server.

Leave a Reply

Your email address will not be published. Required fields are marked *