How to Prepare Your Files
- Prepare one or more .csv or .tsv files
- If you have multiple files, create a ZIP archive containing them
- First row of each file should contain column names
- Upload the .csv/.tsv file or ZIP archive below
CSV (Comma-Separated Values) and other delimited text files are simple, widely-used formats for tabular data.
What You Can Upload
.csvfiles (comma-separated).tsvfiles (tab-separated)- Other delimited text files (semicolon, pipe, etc.)
- ZIP archives with multiple files
What You Get Out
DataMeans extracts your data into multiple modern formats:
| Output | Description |
|---|---|
csv/{TableName}.csv | One CSV file per table with all row data |
xlsx/{TableName}.xlsx | Excel workbook per table |
xls/{TableName}.xls | Legacy Excel format per table |
json/{TableName}.json | JSON array of records per table |
json/{TableName}.jsonl | Newline-delimited JSON (streaming-friendly) |
postgres.sql | PostgreSQL CREATE TABLE + INSERT statements |
schema/schema-graph.json | Relationship graph for visualization |
schema/er-model.json | ER model for diagram tools |
report.json | Structured extraction report |
report.md | Human-readable extraction summary |
How to Export / Obtain Files
Most applications support CSV export:
- Excel/Sheets: File > Save As > CSV
- Databases: Export to CSV option
- Applications: Look for "Export Data" feature
Supported Features
- Multiple delimiter detection (comma, tab, semicolon, pipe)
- Character encoding detection (UTF-8, Latin1, Windows-1252)
- Automatic header row detection
- Type inference (numbers, dates, text)
Known Limitations
- No built-in type system (types inferred from data)
- No nested/hierarchical data support
- Quote escaping varies between implementations
- Date format detection may need verification
Best Practices
- Always include a header row
- Use UTF-8 encoding when possible
- Quote fields containing delimiters or newlines
- Use consistent date formats within columns
Last updated: January 2026
Overview
CSV (Comma-Separated Values) is a simple text format for tabular data exchange, using delimiters to separate fields and line breaks to separate records. Documented by RFC 4180, which also registers the text/csv MIME type, CSV files are widely used for data import/export between applications due to their simplicity and universal support. While not a database format per se, CSV serves as a common interchange format for relational data, with variations in delimiters, quoting, and escaping across implementations.
History and Background
- 1972: IBM's Fortran (level H extended) compiler under OS/360 supports list-directed input/output with commas between values.
- 1978: ANSI approves FORTRAN 77, which formally standardises list-directed I/O using
*as the format specifier; commas or spaces separate values, and quoted character strings may contain commas. - 1983: The manual for the Osborne Executive computer documents the CSV quoting convention of the bundled SuperCalc spreadsheet.
- 2005: RFC 4180 published through the Internet Engineering Task Force (IETF), documenting the common format and registering the text/csv MIME type.
- 2013: W3C charters the CSV on the Web Working Group on 10 December; the group closes on 29 February 2016 after delivering its recommendations.
- 2014: RFC 7111 defines URI fragment identifiers (row, column, cell) for the text/csv media type.
- 2015: W3C publishes the CSV on the Web recommendations, including the Model for Tabular Data and Metadata on the Web, in December.
- Present: Ubiquitous format supported by virtually all major applications and programming languages.
File Format Specifications
CSV files are plain text with specific structural rules.
Basic Structure:
- Records: Lines separated by line breaks (CRLF per RFC 4180; LF widely accepted in practice)
- Final record: The last record in the file may or may not have an ending line break (RFC 4180)
- Fields: Values within records separated by delimiters (comma by default)
- Headers: Optional first record with column names
- Encoding: UTF-8 common today (RFC 4180 cites US-ASCII as common usage); a BOM is sometimes used to signal Unicode
Delimiters and Quoting:
- Field delimiter: Comma (,), semicolon (;), tab (\t), pipe (|)
- Quote character: Double quote (") for fields containing delimiters, quotes, or line breaks
Field Encapsulation:
- Fields containing commas, quotes, or line breaks must be enclosed in double quotes
- Double quotes within fields are escaped by doubling ("")
- Leading/trailing whitespace preserved unless trimmed by parser
- The W3C tabular data model strips leading and trailing whitespace when parsing cells whose datatype is not a string type
Variations:
- Delimiter: Tab (\t) for TSV, semicolon (;) in locales where comma is the decimal separator, pipe (|)
- Quote character: Single quotes (') sometimes used
- Escape sequences: Backslash escaping in some implementations
Key Specifications:
- Maximum field length: Unlimited (implementation-dependent)
- Maximum record length: Unlimited
- Maximum file size: Limited by filesystem and application
- Character encoding: UTF-8 recommended, but varies; the W3C tabular data model further advises Unicode Normalization Form C
- MIME type parameters:
charsetselects the character set;headertakes the valuespresentorabsent(RFC 4180); whencharsetis absent, UTF-8 should be assumed (per the IANA text/csv registration, last updated 2014-01-17) - CSVW datatype aliases:
numbermaps toxsd:double,binarymaps toxsd:base64Binary,datetimemaps toxsd:dateTime, andanymaps toxsd:anyAtomicType(W3C CSVW Metadata Vocabulary)
Data Types and Structures
CSV has no native data types - all fields are stored as strings. Data typing is determined by the importing application.
| Interpreted Type | Example | Notes |
|---|---|---|
| String/Text | "Hello World" | Default type for all fields |
| Integer | 123 | Parsed by applications |
| Float/Decimal | 123.45 | Locale-dependent decimal separator |
| Date | 2023-12-25 | Various formats (ISO, US, European) |
| Time | 14:30:00 | xsd:time under the W3C tabular data model |
| Datetime | 2023-12-25T14:30:00 | xsd:dateTime under the W3C tabular data model |
| Duration | PT2H30M | xsd:duration under the W3C tabular data model |
| Boolean | true, false, 1, 0 | Application-specific |
| Null/Empty | "" or missing | Empty fields represent null values |
| anyURI | https://example.com | xsd:anyURI under the W3C CSVW metadata vocabulary |
| Binary (hex) | 48656C6C6F | xsd:hexBinary under the W3C CSVW metadata vocabulary |
| Binary (base64) | SGVsbG8= | xsd:base64Binary; aliased as binary in the W3C CSVW metadata vocabulary |
File Structure:
- Optional header row with column names
- Data rows with equal number of fields
- Trailing empty lines ignored
- Comments not standardized (some use # prefix)
Version Differences
| Variant | Year | Key Characteristics | Compatibility |
|---|---|---|---|
| RFC 4180 | 2005 | Comma delimiter, CRLF record breaks, optional header line, quotes escaped by doubling | Registers the text/csv MIME type with IANA |
| TSV (tab-separated values) | 1993 | Tab delimiter; header line required; tabs not allowed within fields, so no quoting mechanism | IANA media type text/tab-separated-values |
| RFC 7111 | 2014 | URI fragment identifiers for rows, columns, and cells (row=, col=, cell=) | Updates the text/csv media type registration of RFC 4180 |
| W3C CSV on the Web | 2015 | JSON metadata describing dialect, column datatypes, and annotations for CSV tables | W3C Recommendation (17 December 2015); parsing model based on RFC 4180 |
| CSV Schema (The National Archives, UK) | 2014 | Textual language defining the structure, datatypes, and validation rules for CSV data; version 1.0 published 28 August 2014, revised as version 1.1 | No format change — external schema used to validate .csv files |
Compatibility Notes:
- No formal versioning - dialects vary by application
- Character encoding differences cause issues (UTF-8 vs ANSI)
- Line ending differences (CRLF vs LF) between Windows/Unix
- Quote handling varies (required vs optional)
- Field count mismatches common error
- Spreadsheet grid limits constrain imports: an Excel worksheet holds at most 1,048,576 rows by 16,384 columns, with up to 32,767 characters per cell
- CSV injection (formula injection): fields beginning with
=,+,-,@, tab, carriage return, or line feed may be interpreted as formulas by spreadsheet software - TSV constraint: the IANA text/tab-separated-values registration explicitly prohibits tab characters within field values, so TSV has no quoting mechanism for tabs; CSV is preferred when field data may contain tab characters
Technical References
- RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files
- RFC 7111: URI Fragment Identifiers for the text/csv Media Type
- W3C: Model for Tabular Data and Metadata on the Web
- IANA: text/tab-separated-values Media Type Registration
- Wikipedia: Comma-separated values
To learn how to use this format with DataMeans, see the User Guide.