Fast CSV Processing with SIMD

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • Scout APM - Less time debugging, more time building
  • SonarLint - Clean code begins in your IDE with SonarLint
  • SaaSHub - Software Alternatives and Reviews
  • nio

    Low Overhead Numerical/Native IO library & tools (by c-blake)

    I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim

    While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.

    As commented elsewhere, but bearing repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.

    The main reason not to do this is if you have no disk space for it.

  • ParquetViewer

    Simple windows desktop application for viewing & querying Apache Parquet files

    Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.

    One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!

    [1] https://github.com/mukunku/ParquetViewer

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.

    https://zeppelin.apache.org/

  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    I really should write up how we did delimiter and quote detection in this library:

    https://github.com/capitalone/DataProfiler

    It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.

    Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts