Windows CSV File Format
This document describes a standard Windows CSV file format for exchanging tabular data on the Windows platform and between Windows applications like Excel, SQL Server and Access. Because Windows typically uses UTF-16 and has poor support for UTF-8, it is often desirable (though less efficient) to exchange data in this format instead of the one specified in: CSV File Format.
Header Row
The first row of the file must contain a header with the name of each column. the header follows the conventions below for strings.
The header row also determines the number of columns for every row in the reminder of the file. Should a row contain fewer or more columns than in the header – this is considered an error in the file.
Column Delimiter
Columns should be delimited by a tab (, – 0x09). This character is chosen because it is the standard choice when Excel saves in the .txt file format (Save option: UTF-16 Unicode Text).
Row Delimiter
Rows must be delimited by the Carriage Return (CR – 0x0D) character followed the Line Feed (LF – 0x0A) character. Note that this is not the default on Windows systems and old Macintosh machines and on Unix and Linux. Files originating from those operating system must typically be converted before importing to Windows. (HELP WANTED: If anyone has written a utility to convert, I would happily link it here)
Data Type Representation
The subsections here describe details on how to represent common data types.
Representing Strings
String values should be quoted using double quotes (” – 0x22) when they contain either the comma, CR+LF or quote character. For example, the following is correct:
1 2 3 4 5 6 7 8 9 |
Year Country Value Description 2010 SE 42 "This sentence has a tab take care and quote it" 2011 UK 43 "It is said that ""there is nowt so queer as folk""" 2010 DK 7 This is just a normal string 2011 DK 7 NoNeedToRemoveSpaceInStrings |
String Escape Sequences
If data contains quotes, CR, LF or other reserved characters in text strings, the data itself should be formatted like this:
Character | ASCII | Description | Escape Sequence |
---|---|---|---|
“ | 0x22 | Double Quotes | “” |
LF | 0x0A | Line Feed | Quote the string that contains it |
CR | 0x0D | Carriage Return | Quote the string that contains it |
TAB | 0x09 | Tab | Quote the string that contains it |
The rules for quoting special characters apply to the header file too.
Representing Numbers
Decimals should be separated with period (. – 0x2E) not with comma. Thousand separators should not be used. Zero padding should not be used either.
Representing Date and Time
Date and time should be represented using the ISO 8601 and RFC 3339 format. For reference, the format is:
Unless specified otherwise, times are assumed to be in UTC. Dates should use the Gregorian Calendar. Year should always include at least four digits.
See:
- ISO 8601 WikiPedia entry on ISO 8601 (Since the ISO standard itself is behind a paywall)
- RFC 3339 Describes time zone and year offsets
- Gregorian Calendar A detailed description of the problems arising with dates before 1582.
Representing Countries
Data about countries should use the ISO-3166-1 alpha-2 country codes instead of the country name.
See:
- ISO-3166-1 WikiPedia entry on ISO 3166 (Since the ISO standard itself is behind a paywall)
- ISO Countries CSV file with all ISO-3166-1 countries, formatted like described on this page.
Representing Currencies
Currencies should be represented using the ISO-4217 currency codes. Bit coins can be represented using the code BTC.
See:
- ISO-4217 on WikiPedia
Encoding
The entire file should be encoded with UTF-16. While this is not the most compact format, it is the format understood by nearly all Windows applications
MIME Type and Extension
The MIME type of CSV files is text/csv. Excel traditionally uses the extension .txt for tab separated, unicode UTF-16 values.
See: