CSV File Format

CSVIconThis document describes a standard CSV file format for exchanging tabular data using text files.

Header Row

The first row of the file must contain a header with the name of each column. the header follows the conventions below and the title of each column is considered a string and hence must be quoted.

The header row also determines the number of columns for every row in the reminder of the file. Should a row contain fewer or more columns than in the header – this is considered an error in the file.

Column Delimiter

Columns should be delimited by a pipe (| – 0x7C). This character is chosen because it is unlikely to occur in plain text and it makes it easy for a human to read the format and distinguish columns from each other.

Row Delimiter

Rows must be delimited by the Line Feed (LF – 0x0A) character. Note that this is not the default on Windows systems and old Macintosh machines. The reason LF is chosen is that this makes the files smaller (as compared with the Windows standard of CR+LF)

Data Type Representation

The subsections here describe details on how to represent common data types.

Representing Strings

String values should be quoted using double quotes (” – 0x22). For example, the following is correct:

“Year”|”Country”|”Value”

2010|”SE”|42

2011|”SE”|43

2010|”DK”|7

2011|”DK”|7

String Escape Sequences
If data contains quotes, LF or other reserved characters in text strings, the data itself should be escaped like this:

Character ASCII Description Escape Sequence
0x22 Double Quotes
| 0x7C Pipe |
LF 0x0A Line Feed n
0x5C Backslash \

Representing Numbers

Decimals should be separated with period (. – 0x2E). Thousand separators should not be used. Zero padding should not be used either.

Representing Date and Time

Date and time should be represented using the ISO 8601 and RFC 3339 format. For reference, the format is:

YYYY-MM-DDThh:mm:ss

Unless specified otherwise, times are assumed to be in UTC. Dates should use the Gregorian Calendar. Year should always include at least four digits.

See:

  • ISO 8601 WikiPedia entry on ISO 8601 (Since the ISO standard itself is behind a paywall)
  • RFC 3339 Describes time zone and year offsets
  • Gregorian Calendar A detailed description of the problems arising with dates before 1582.

Representing Countries

Data about countries should use the ISO-3166-1 alpha-2 country codes instead of the country name.

See:

  • ISO-3166-1 WikiPedia entry on ISO 3166 (Since the ISO standard itself is behind a paywall)
  • ISO Countries CSV file with all ISO-3166-1 countries, formatted like described on this page.

Representing Currencies

Currencies should be represented using the ISO-4217 currency codes. Bit coins can be represented using the code BTC.

See:

Encoding

The entire file should be encoded with UTF-8. This is the most compact format that allows representation of unicode while remaining reasonably compatible with ASCII.

MIME Type

The MIME type of CSV files is text/csv.

See:

  • RFC 2048 – Details of MIME types
  • RFC 4180 – Defines that text/csv is the MIME type of CSV

  5Comments

  1. Dave Poole   •  

    There are a few useful standards
    IATA – Internaltion Air Transport Association (airline and airport codes)
    ICAO – International Civil Aviation Organisation (airline and airport codes)
    ACRISS Association of Car Rental Industry Systems Standards
    ISO5218 – Gender (0 = unknown, 1 = Male, 2 = Female, 9 = Not appliable, in other words a company which is a legal person)
    ISO639 – Languages
    SIC codes, NAIC codes.

    The problem I’ve hit trying to produce a unified set of DW dimensions is that a commonly used term has a lot of similar but not quite the same meanings and the reference sets don’t really have a simple mapping.

    Take cars as an example. Connecting the following is a nightmare.
    1. Cars for the purpose of renting a car
    2. Cars for the purpose of insuring a car
    3. Cars for the purpose of valuing a car.

    • Thomas Kejser   •     Author

      Hi Dave

      Thanks for those pointers. I will have a look at them. You are indeed right that it is often hard to consolidate customer data with standard. However, in the case of currencies, countries and languages – I think we are pretty safe 🙂

      The airport codes might come in quite handy too.

  2. Niels Brinch   •  

    I would love one standard for CSV files so thats a great idea. If we couldnhave one standard then it might as well be this one. However, of the last 30 CSV formats I’ve had to work with, none of them used pipe as delimiter. Most used semicolon.

    • Thomas Kejser   •     Author

      Hi Niels

      You are indeed right that there are several different variants of what a “CSV file” is. The reason I picked the pipe delimiter is:

      – This is the format that the dbgen utilities for TPC spits out
      – It’s unlikely to find a pipe in a text string, so even if you “forget” to quote the string, things should work fine
      – Pipe separated is visually very easy to read by a human too if the columns are of similar width

      I suppose the counter argument is that excel uses semi-colon as its default export format.

Leave a Reply

Your email address will not be published. Required fields are marked *