Our great sponsors
-
KeenWrite
Discontinued Free, open-source, cross-platform desktop Markdown text editor with live preview, string interpolation, and math.
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Since this is about CSV, this is obligatory tool for larger ones:
* https://github.com/antonycourtney/tad
> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.
FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Because Parquet is well-defined, and file sizes are a fraction of their CSV equivalents.
> Is it a limitation of the format itself?
I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).
No one uses that format for streamed json, see ndson and jsonl
http://ndjson.org/
The size complaint is overblown, as repeated fields are compressed away.
As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.
What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.
For manipulating CSV from the terminal, check out https://github.com/BurntSushi/xsv
i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].
csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.
https://github.com/nathants/bsv