Fast CSV Processing with SIMD

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. nio

    Low Overhead Numerical/Native IO library & tools (by c-blake)

    I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim

    While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.

    As commented elsewhere, but bearing repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.

    The main reason not to do this is if you have no disk space for it.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. ParquetViewer

    Simple Windows desktop application for viewing & querying Apache Parquet files

    Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.

    One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!

    [1] https://github.com/mukunku/ParquetViewer

  4. Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.

    https://zeppelin.apache.org/

  5. DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    I really should write up how we did delimiter and quote detection in this library:

    https://github.com/capitalone/DataProfiler

    It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.

    Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 📊 Visualise Presto Queries with Apache Zeppelin: A Hands-On Guide

    1 project | dev.to | 12 May 2025
  • 40x Faster! We rewrote our project with Rust!

    5 projects | /r/rust | 30 Jan 2023
  • Wanting to move away from SQL

    2 projects | /r/dataengineering | 25 Feb 2022
  • How to use IPython in Apache Zeppelin Notebook

    2 projects | dev.to | 10 Jul 2021
  • Using InterSystems Caché and Apache Zeppelin

    1 project | dev.to | 25 Apr 2021

Did you know that Java is
the 8th most popular programming language
based on number of references?