Fast CSV Processing with SIMD

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • nio

    Low Overhead Numerical/Native IO library & tools (by c-blake)

  • I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim

    While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.

    As commented elsewhere, but bearing repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.

    The main reason not to do this is if you have no disk space for it.

  • ParquetViewer

    Simple windows desktop application for viewing & querying Apache Parquet files

  • Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.

    One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!

    [1] https://github.com/mukunku/ParquetViewer

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.

    https://zeppelin.apache.org/

  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  • I really should write up how we did delimiter and quote detection in this library:

    https://github.com/capitalone/DataProfiler

    It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.

    Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 40x Faster! We rewrote our project with Rust!

    5 projects | /r/rust | 30 Jan 2023
  • Wanting to move away from SQL

    2 projects | /r/dataengineering | 25 Feb 2022
  • How to use IPython in Apache Zeppelin Notebook

    2 projects | dev.to | 10 Jul 2021
  • Using InterSystems Caché and Apache Zeppelin

    1 project | dev.to | 25 Apr 2021
  • Is there a way to collaborate in real-time for Jupyter Notebooks?

    2 projects | /r/learnpython | 21 Mar 2021