Modernizing AWK, a 45-year old language, by adding CSV support

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • tsv-utils

    eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

  • When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a massive expansion in space as per some image/video formats.

    With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell a|b run programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b. Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.

    I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).

    Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (Well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

  • usv

    Unicode Separated Values (USV) data markup for units, records, groups, files, streaming, and more.

  • Ben this is great, thank you. Would you be open to adding Unicode Separated Values (USV) as well? It's much like CSV and also simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.

    https://github.com/sixarm/usv

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • nio

    Low Overhead Numerical/Native IO library & tools (by c-blake)

  • When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a massive expansion in space as per some image/video formats.

    With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell a|b run programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b. Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.

    I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).

    Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (Well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

  • miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

  • I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.

    [0]: https://github.com/johnkerl/miller

  • simonwillisonblog

    The source code behind my blog

  • For anything down and dirty, what's wrong with -F'"'? For anything fancy there are plenty of things like the below.

    eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

    includes csv to tsv: https://github.com/eBay/tsv-utils

    HT: https://simonwillison.net/

  • xsv

    A fast CSV command line toolkit written in Rust.

  • xsv does similar stuff for CSV, and very rapidly: https://github.com/BurntSushi/xsv

    https://miller.readthedocs.io/en/latest/why/ has a nice section on "why miller":

    > First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.

  • csvquote

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • qsv

    CSVs sliced, diced & analyzed.

  • I was using xsv a lot at work (it is so much faster than csvkit) but I've recently jumped to qsv, a fork with more features.

    https://github.com/jqnatividad/qsv

  • zsv

    zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more

  • This is not an outlier. `mlr` is quite slow, literally off-the-charts slow for our purposes when we benchmarked vs xsv and zsv (see https://github.com/liquidaty/zsv. disclaimer: I'm one of its authors)

  • goawk

    A POSIX-compliant AWK interpreter written in Go, with CSV support

  • I've also added explicit tests for ASCII and Unicode unit and record separators, just to ensure I don't regress: https://github.com/benhoyt/goawk/commit/215652de58f33630edb0...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts