-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim
While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.
As commented elsewhere, but bearing repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.
The main reason not to do this is if you have no disk space for it.
Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.
One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!
[1] https://github.com/mukunku/ParquetViewer
I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.
https://zeppelin.apache.org/
I really should write up how we did delimiter and quote detection in this library:
https://github.com/capitalone/DataProfiler
It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.
Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).