CSV requirements
In general, we recommend the usage of PARQUET files, as these are compressed in size as well as contain properly
typed data types. But data can be certainly also provided in CSV format, either uncompressed or compressed as .gz,
given they adhere to the following rules. They must be encoded in UTF-8, use commas (,), semicolons (;) or tab (\t) as
column separators, and start with a single header line, containing the column names.
1. Header row
- The first row must contain the column names.
- Each column name in a table must be unique.
2. Rows
- Each row in the file must contain the same number of cells.
3. Alphanumeric entries (text, categories, strings)
- Entries containing line breaks, and spaces at the beginning or end, must be quoted with double-quotes.
“this is, one column”
“this is \n two lines”
“ space at the beginning and end “- double quotes in entries must be escaped with double quotes itself
“this does contain “”quoted text”””4. Datetime values
- must be encoded in one of the formats below
- missing values must appear as empty strings
| Format | Example | |
|---|---|---|
| Date | yyyy-MM-dd | 2020-02-08 |
| Datetime with hours | yyyy-MM-dd HHyyyy-MM-ddTHHyyyy-MM-ddTHHZ | 2020-02-08 092020-02-08T092020-02-08T09Z |
| Datetime with minutes | yyyy-MM-dd HH:mmyyyy-MM-ddTHH:mmyyyy-MM-ddTHH:mmZ | 2020-02-08 09:302020-02-08T09:302020-02-08T09:30Z |
| Datetime with seconds | yyyy-MM-dd HH:mm:ssyyyy-MM-ddTHH:mm:ssyyyy-MM-ddTHH:mm:ssZ | 2020-02-08 09:30:262020-02-08T09:30:262020-02-08T09:30:26Z |
| Datetime with milliseconds | yyyy-MM-dd HH:mm:ss.SSSyyyy-MM-ddTHH:mm:ss.SSSyyyy-MM-ddTHH:mm:ss.SSSZ | 2020-02-08 09:30:26.1232020-02-08T09:30:26.1232020-02-08T09:30:26.123Z |
5. Numerical values
- must have a
.as decimal separator - must not have a thousands separator
- must have missing values encoded as empty strings