Fast CSV Processing with SIMD

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

nio

7 32 6.3 Nim

Low Overhead Numerical/Native IO library & tools (by c-blake)

I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim
While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.
As commented elsewhere, but bearing repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.
The main reason not to do this is if you have no disk space for it.

ParquetViewer

1 646 8.5 C#

Simple windows desktop application for viewing & querying Apache Parquet files

Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.
One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!
[1] https://github.com/mukunku/ParquetViewer

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Zeppelin

8 6,263 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.
https://zeppelin.apache.org/

DataProfiler

61 1,362 6.3 Python

What's in your data? Extract schema, statistics and entities from datasets

I really should write up how we did delimiter and quote detection in this library:
https://github.com/capitalone/DataProfiler
It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.
Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

40x Faster! We rewrote our project with Rust!

5 projects | /r/rust | 30 Jan 2023
Wanting to move away from SQL

2 projects | /r/dataengineering | 25 Feb 2022
How to use IPython in Apache Zeppelin Notebook

2 projects | dev.to | 10 Jul 2021
Using InterSystems Caché and Apache Zeppelin

1 project | dev.to | 25 Apr 2021
Is there a way to collaborate in real-time for Jupyter Notebooks?

2 projects | /r/learnpython | 21 Mar 2021

Fast CSV Processing with SIMD

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Science and Data analysis Python Parquet Scala
Post date: 4 Dec 2021

nio

ParquetViewer

InfluxDB

Zeppelin

DataProfiler

Related posts

40x Faster! We rewrote our project with Rust!

Wanting to move away from SQL

How to use IPython in Apache Zeppelin Notebook

Using InterSystems Caché and Apache Zeppelin

Is there a way to collaborate in real-time for Jupyter Notebooks?

Fast CSV Processing with SIMD

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Big Data Science and Data analysis Python Parquet Scala Post date: 4 Dec 2021

nio

ParquetViewer

InfluxDB

Zeppelin

DataProfiler

Related posts

40x Faster! We rewrote our project with Rust!

Wanting to move away from SQL

How to use IPython in Apache Zeppelin Notebook

Using InterSystems Caché and Apache Zeppelin

Is there a way to collaborate in real-time for Jupyter Notebooks?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Science and Data analysis Python Parquet Scala
Post date: 4 Dec 2021