Why isn’t there a decent file format for tabular data?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • csvz

    The hot new standard in open databases

  • I put some work into creating a standard, csvz, for putting csv files and their metadata, into a zip file.

    https://github.com/secretGeek/csvz

    It’s a pretty powerful concept.

    SimonW’s preferred technique of using sqlite as the means of exchange is also very powerful. Particularly when combined with all of the utils he maintains.

  • AwesomeCSV

    🕶️A curated list of awesome tools for dealing with CSV.

  • One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.

    Some systems ban embedded record delimiters, for this reason.

    Btw, I’ve (previously) included at least one of your essays in “awesome csv” list at GitHub. https://github.com/secretGeek/AwesomeCSV#essays

    There’s a few specs mentioned there too — is one of those the spec you worked on?

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • oil

    Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!

  • Related thread: CSVs: The Good, the Bad, and the Ugly (2020)

    https://news.ycombinator.com/item?id=25014721

    FWIW I have designed an upgrade to TSV as part of https://www.oilshell.org. Oil actually emits it now from 'pp proc' to pretty print "functions" and their docstrings as a table.

    It will be used in many parts of Oil, most of which aren't implemented yet.

    It is called QTT -- Quoted, Typed Tables (formerly QTSV)

    It's built on top of QSN, which is just Rust string literal notation -- https://www.oilshell.org/release/latest/doc/qsn.html

    There is a stub doc which links to a wiki proposal -- http://www.oilshell.org/release/latest/doc/qtt.html

    ------

    I think the only way that such a format becomes popular is if Oil itself becomes popular, thus I haven't really emphasized it as a spec. (similar to how JSON wouldn't be popular if the browser / JS weren't popular)

    But it is a pretty easy spec -- just take TSV, add types to the column header, and specify that any cell that begins with a single quote is QSN.

    A nice property is that every almost TSV file is a valid QTT file -- the exception being TSV files where a cell just a single quote, or some malformed QSN.

    Note that TSV officially can't represent fields with tabs, but QTT can with '\t'.

    As always feel free to contact me if you want to work on any of this

    ----

    The ASCII field separators don't make sense -- they break tools and take you off the narrow waist of Unix. https://www.oilshell.org/blog/2022/02/diagrams.html

    A key design point of QTT is that it's a special case of the "lines of text" narrow waist. CSV does not have that property, because fields can contain embedded newlines. And unlike TSV, QTT cells can contain arbitrary values.

  • ssv

    Separator Separated Values file format and libraries (by tmccombs)

  • https://github.com/tmccombs/ssv

    So far it's just a python library, but I'm planning on adding editor plugins at least for vim, vscode and maybe emacs, libraries for additional languages, and maybe some cli commands for it.

    One distinction from the OP is the delimiters also include a tab (for fields) or newline (for records) by default (but not in "compact" mode). That has the benefit that the files are at least readable with editors and pagers that aren't aware of the format.

  • SheetJS js-xlsx

    📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

  • That they are used doesn't make them "decent". CSV and XLSX have a host of "edge cases" that do not adhere to RFC4180 / ECMA-376. The CSV "specification" RFC4180 is notoriously silent on matters of value interpretation (this same problem affects the format proposed by the author). ECMA-376 has a number of underspecified features and other specifications like MS-XLSX attempt to define Excel's behavior.

    This is not an issue in an environment where every XLSX file is written by Excel 2019 or where every CSV file is exported from Pandas dataframes. And when it does turn out to be an issue, the processes will change (as when the SEPT1 gene was renamed to SEPTIN1 to avoid Excel date interpretation)

    We maintain open-source libraries for processing spreadsheets (https://github.com/SheetJS/sheetjs) and we are continually surprised by the undocumented corner cases that are exploited by third-party software.

  • odiff

    The fastest pixel-by-pixel image visual difference tool in the world.

  • Yes, multiple tools exist to diff images, e.g. https://github.com/dmtrKovalenko/odiff

  • parquet-wasm

    Rust-based WebAssembly bindings to read and write Apache Parquet data

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • ndjson-spec

    Specification

  • ndjson is actually a really pragmatic choice here that should not be overlooked.

    Tabular formats break down when the data stops being tabular. This comes up a lot. People love spread sheets as editing tools but they then end up doing things like putting comma separated values in a cell. I've also seen business people use empty cells to indicate hierarchical 'inheritance". An alternate interpretation of that is that that data has some kind of hierarchy and isn't really row based. People just shoehorn all sorts of stuff into spreadsheets because they are there.

    With ndjson, every line is a json object. Every cell is a named field. If you need multiple values, you can use arrays for the fields. Json has actual types (int, float, strings, boolean). So you can have both hierarchical and multivalued data in a row. The case where all the fields are simple primitives is just the simple case. It has an actual specification too: https://github.com/ndjson/ndjson-spec. I like it because I can stream process it and represent arbitrarily complex objects/documents instead of having to flatten it into columns. The parsing overhead makes it more expensive to use than tsv though. The file size is fine if you use e.g. gzip compression. It compresses really well generally.

    But I also use either tab separated values quite often for simpler data. I mainly like it because google spread sheets provides that as an export option and is actually a great editor for tabular data that I can just give to non technical people.

    Both file formats can be easily manipulated with command line tools (jq, csvkit, sed, etc.). Both can be processed using mature parsers in a wide range of languages. If you really want, you can edit them with simple text editors, though you probably should be careful with that. Tools like bat know how to format and highlight these files as well. Etc. Tools like that are important because you can use them and script them together rather than reinventing wheels.

    Formats like parquet are cumbersome mainly because none of the tools I mention support it. No editors. Not a lot of command line tools. No formatting tools. If you want to inspect the data, you pretty much have to write a program to do it. I guess this would be fixable but people seem to be not really interested in doing that work. Parquet becomes nice when you need to process data at scale and in any case use a lot of specialized tooling and infrastructure. Not for everyone in other words.

    Character encoding is not an issue with either tsv or ndjson if you simply use UTF-8, always. I see no good technical reason why you should use anything else. Anything else should be treated as a bug or legacy. Of course a lot of data has encoding issues regardless. Shit in, shit out basically. Fix it at the source, if you can.

    The last point is actually key because all of the issues with e.g. csv usually start with people just using really crappy tools to produce source data. Switching to a different file format won't fix these issues since you still deal with the same crappy tools that of course do not support this file format. Anything else you could just fix to not suck to begin with. And if you do, it stops being an issue. The problem is when you can't.

    Nothing wrong with tsv if you use UTF-8 and a some nice framework that generates properly escaped values and does all the right things. The worst you can say about it is that there are a bit too many choices here and people tend to improvise their own crappy data generation tools with escaping bugs and other issues. Most of the pain is self inflicted. The reason csv/tsv are popular is that you don't need a lot of frameworks / tools. But of course the flipside is that DYI leads to people introducing all sorts of unnecessary issues. Try not to do that.

  • simdjson

    Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

  • hsv5

    HTML5 Based Alternative to CSV, TSV, JSONL, etc

  • I've pondered exactly that. After a previous HN thread, I tried my hand at writing a specification that was minimal as possible but followed HTML5. For example many end tags are optional like `tr` and `td` end tags. Though I pulled in RDFa for richer data types.

    Here's the GitHub repo for what I like to call HSV5: https://github.com/elcritch/hsv5/blob/main/README.md ;)

    And an example of the smallest _valid_ HTML table format:

        

  • TileDB

    The Universal Storage Engine

  • Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.

    Similar to Parquet:

    * TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.

    * TileDB is built in C++ with multi-threading and vectorization in mind

    * TileDB integrates with Arrow, using zero-copy techniques

    * TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)

    * TileDB pushes compute down to storage, similar to what Arrow does

    Better than Parquet:

    * TileDB is multi-dimensional, allowing rapid multi-column conditions

    * TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)

    * TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)

    * TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)

    Useful links:

    * Github repo (https://github.com/TileDB-Inc/TileDB)

    * TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)

    * Docs (https://docs.tiledb.com/)

    * Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)

    Happy to hear everyone’s thoughts.

  • rson

    Rust Object Notation

  • Hm I wasn't aware of RSON. https://github.com/rson-rs/rson

    QSN isn't intended to be tied to Rust in any way (and isn't), while RSON says it uses the Serde data model.

    This gets at an issue I have been having a hard time explaining, mentioned here:

    http://www.oilshell.org/blog/2022/03/backlog-arch.html

    That is, narrow waists are necessarily a COMPROMISE. JSON is a compromise, and Rust users will be equally unhappy as Lua or Erlang users. That is a feature and not a bug for something meant of interoperability. You are "stuck with" the lowest common denominator, but that's what enables interop.

    I contrast "monoglot" serialization formats like Python pickle an Go .gob with language-independent formats like JSON, TSV, and HTML. The wisdom of JSON is that Crockford specified it independently of JavaScript.

    But both are useful.

    It's not clear if RSON is meant to be monoglot or polyglot, but it's a huge difference and it seems more monoglot. QSN on the other hand is definitely a polyglot design like JSON, despite being derived from Rust.

  • catsql

    cat for sql dbs

  • There is catsql[1] but it barfs in my Python environment.

    [1]https://github.com/paulfitz/catsql

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts