Why isn’t there a decent file format for tabular data?

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

csvz

3 30 4.4

The hot new standard in open databases

I put some work into creating a standard, csvz, for putting csv files and their metadata, into a zip file.
https://github.com/secretGeek/csvz
It’s a pretty powerful concept.
SimonW’s preferred technique of using sqlite as the means of exchange is also very powerful. Particularly when combined with all of the utils he maintains.

AwesomeCSV

1 646 3.2 PowerShell

🕶️A curated list of awesome tools for dealing with CSV.

One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.
Some systems ban embedded record delimiters, for this reason.
Btw, I’ve (previously) included at least one of your essays in “awesome csv” list at GitHub. https://github.com/secretGeek/AwesomeCSV#essays
There’s a few specs mentioned there too — is one of those the spec you worked on?

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
oil

234 2,720 9.9 Python

Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!

Related thread: CSVs: The Good, the Bad, and the Ugly (2020)
https://news.ycombinator.com/item?id=25014721
FWIW I have designed an upgrade to TSV as part of https://www.oilshell.org. Oil actually emits it now from 'pp proc' to pretty print "functions" and their docstrings as a table.
It will be used in many parts of Oil, most of which aren't implemented yet.
It is called QTT -- Quoted, Typed Tables (formerly QTSV)
It's built on top of QSN, which is just Rust string literal notation -- https://www.oilshell.org/release/latest/doc/qsn.html
There is a stub doc which links to a wiki proposal -- http://www.oilshell.org/release/latest/doc/qtt.html
------
I think the only way that such a format becomes popular is if Oil itself becomes popular, thus I haven't really emphasized it as a spec. (similar to how JSON wouldn't be popular if the browser / JS weren't popular)
But it is a pretty easy spec -- just take TSV, add types to the column header, and specify that any cell that begins with a single quote is QSN.
A nice property is that every almost TSV file is a valid QTT file -- the exception being TSV files where a cell just a single quote, or some malformed QSN.
Note that TSV officially can't represent fields with tabs, but QTT can with '\t'.
As always feel free to contact me if you want to work on any of this
----
The ASCII field separators don't make sense -- they break tools and take you off the narrow waist of Unix. https://www.oilshell.org/blog/2022/02/diagrams.html
A key design point of QTT is that it's a special case of the "lines of text" narrow waist. CSV does not have that property, because fields can contain embedded newlines. And unlike TSV, QTT cells can contain arbitrary values.

ssv

1 0 3.9 Python

Separator Separated Values file format and libraries (by tmccombs)

https://github.com/tmccombs/ssv
So far it's just a python library, but I'm planning on adding editor plugins at least for vim, vscode and maybe emacs, libraries for additional languages, and maybe some cli commands for it.
One distinction from the OP is the delimiters also include a tab (for fields) or newline (for records) by default (but not in "compact" mode). That has the benefit that the files are at least readable with editors and pagers that aren't aware of the format.

SheetJS js-xlsx

61 34,479 2.4 JavaScript

📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

That they are used doesn't make them "decent". CSV and XLSX have a host of "edge cases" that do not adhere to RFC4180 / ECMA-376. The CSV "specification" RFC4180 is notoriously silent on matters of value interpretation (this same problem affects the format proposed by the author). ECMA-376 has a number of underspecified features and other specifications like MS-XLSX attempt to define Excel's behavior.
This is not an issue in an environment where every XLSX file is written by Excel 2019 or where every CSV file is exported from Pandas dataframes. And when it does turn out to be an issue, the processes will change (as when the SEPT1 gene was renamed to SEPTIN1 to avoid Excel date interpretation)
We maintain open-source libraries for processing spreadsheets (https://github.com/SheetJS/sheetjs) and we are continually surprised by the undocumented corner cases that are exploited by third-party software.

odiff

2 1,801 3.8 OCaml

The fastest pixel-by-pixel image visual difference tool in the world.

Yes, multiple tools exist to diff images, e.g. https://github.com/dmtrKovalenko/odiff

parquet-wasm

5 394 8.8 Rust

Rust-based WebAssembly bindings to read and write Apache Parquet data
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
ndjson-spec

6 633 0.0

Specification

ndjson is actually a really pragmatic choice here that should not be overlooked.
Tabular formats break down when the data stops being tabular. This comes up a lot. People love spread sheets as editing tools but they then end up doing things like putting comma separated values in a cell. I've also seen business people use empty cells to indicate hierarchical 'inheritance". An alternate interpretation of that is that that data has some kind of hierarchy and isn't really row based. People just shoehorn all sorts of stuff into spreadsheets because they are there.
With ndjson, every line is a json object. Every cell is a named field. If you need multiple values, you can use arrays for the fields. Json has actual types (int, float, strings, boolean). So you can have both hierarchical and multivalued data in a row. The case where all the fields are simple primitives is just the simple case. It has an actual specification too: https://github.com/ndjson/ndjson-spec. I like it because I can stream process it and represent arbitrarily complex objects/documents instead of having to flatten it into columns. The parsing overhead makes it more expensive to use than tsv though. The file size is fine if you use e.g. gzip compression. It compresses really well generally.
But I also use either tab separated values quite often for simpler data. I mainly like it because google spread sheets provides that as an export option and is actually a great editor for tabular data that I can just give to non technical people.
Both file formats can be easily manipulated with command line tools (jq, csvkit, sed, etc.). Both can be processed using mature parsers in a wide range of languages. If you really want, you can edit them with simple text editors, though you probably should be careful with that. Tools like bat know how to format and highlight these files as well. Etc. Tools like that are important because you can use them and script them together rather than reinventing wheels.
Formats like parquet are cumbersome mainly because none of the tools I mention support it. No editors. Not a lot of command line tools. No formatting tools. If you want to inspect the data, you pretty much have to write a program to do it. I guess this would be fixable but people seem to be not really interested in doing that work. Parquet becomes nice when you need to process data at scale and in any case use a lot of specialized tooling and infrastructure. Not for everyone in other words.
Character encoding is not an issue with either tsv or ndjson if you simply use UTF-8, always. I see no good technical reason why you should use anything else. Anything else should be treated as a bug or legacy. Of course a lot of data has encoding issues regardless. Shit in, shit out basically. Fix it at the source, if you can.
The last point is actually key because all of the issues with e.g. csv usually start with people just using really crappy tools to produce source data. Switching to a different file format won't fix these issues since you still deal with the same crappy tools that of course do not support this file format. Anything else you could just fix to not suck to begin with. And if you do, it stops being an issue. The problem is when you can't.
Nothing wrong with tsv if you use UTF-8 and a some nice framework that generates properly escaped values and does all the right things. The worst you can say about it is that there are a bit too many choices here and people tend to improvise their own crappy data generation tools with escaping bugs and other issues. Most of the pain is self inflicted. The reason csv/tsv are popular is that you don't need a lot of frameworks / tools. But of course the flipside is that DYI leads to people introducing all sorts of unnecessary issues. Try not to do that.

simdjson

65 18,362 9.2 C++

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
hsv5

2 1 0.0 Nim

HTML5 Based Alternative to CSV, TSV, JSONL, etc

I've pondered exactly that. After a previous HN thread, I tried my hand at writing a specification that was minimal as possible but followed HTML5. For example many end tags are optional like `tr` and `td` end tags. Though I pulled in RDFa for richer data types.
Here's the GitHub repo for what I like to call HSV5: https://github.com/elcritch/hsv5/blob/main/README.md ;)
And an example of the smallest _valid_ HTML table format:
    

TileDB

12 1,762 9.7 C++

The Universal Storage Engine

Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.
Similar to Parquet:
* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.
* TileDB is built in C++ with multi-threading and vectorization in mind
* TileDB integrates with Arrow, using zero-copy techniques
* TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)
* TileDB pushes compute down to storage, similar to what Arrow does
Better than Parquet:
* TileDB is multi-dimensional, allowing rapid multi-column conditions
* TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)
* TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)
* TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)
Useful links:
* Github repo (https://github.com/TileDB-Inc/TileDB)
* TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)
* Docs (https://docs.tiledb.com/)
* Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)
Happy to hear everyone’s thoughts.

rson

1 21 0.0 Rust

Rust Object Notation

Hm I wasn't aware of RSON. https://github.com/rson-rs/rson
QSN isn't intended to be tied to Rust in any way (and isn't), while RSON says it uses the Serde data model.
This gets at an issue I have been having a hard time explaining, mentioned here:
http://www.oilshell.org/blog/2022/03/backlog-arch.html
That is, narrow waists are necessarily a COMPROMISE. JSON is a compromise, and Rust users will be equally unhappy as Lua or Erlang users. That is a feature and not a bug for something meant of interoperability. You are "stuck with" the lowest common denominator, but that's what enables interop.
I contrast "monoglot" serialization formats like Python pickle an Go .gob with language-independent formats like JSON, TSV, and HTML. The wisdom of JSON is that Crockford specified it independently of JavaScript.
But both are useful.
It's not clear if RSON is meant to be monoglot or polyglot, but it's a huge difference and it seems more monoglot. QSN on the other hand is definitely a polyglot design like JSON, despite being derived from Rust.

catsql

1 70 10.0 Python

cat for sql dbs

There is catsql[1] but it barfs in my Python environment.
[1]https://github.com/paulfitz/catsql

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

need help!! how to update JSON file without buying server
1 project | /r/developersIndia | 14 Jun 2022
How to Replace Excel?
2 projects | news.ycombinator.com | 18 May 2022
This is how they cherrypick the data to create hate against a certain community
1 project | /r/IndiaSpeaks | 9 May 2021
India's daily vaccination number ?
1 project | /r/india | 7 May 2021
[OC] Extent of Second COVID-19 wave in India
1 project | /r/dataisbeautiful | 11 Apr 2021

Why isn’t there a decent file format for tabular data?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Database JSON CSV Data Science JavaScript
Post date: 3 May 2022

csvz

AwesomeCSV

InfluxDB

oil

ssv

SheetJS js-xlsx

odiff

parquet-wasm

WorkOS

ndjson-spec

simdjson

hsv5

TileDB

rson

catsql

Related posts

Why isn’t there a decent file format for tabular data?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Database JSON CSV Data Science JavaScript Post date: 3 May 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Database JSON CSV Data Science JavaScript
Post date: 3 May 2022