Parquet: More than just “Turbo CSV”

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

ryu

1 1,131 5.9 C++

Converts floating point numbers to decimal strings (by ulfjack)

> There isn't really a CSV standard that defines the precise grammar of CSV.
Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?
JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.
Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?
JSON.
arrow-tools

0 121 8.6 Rust

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet

If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
parquet-format

3 1,633 7.4 Java

Apache Parquet

Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.
fast_float

0 1,267 8.8 C++

Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12 and WebKit/Safari

> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
rapidgzip

10 311 9.6 C++

Gzip Decompression and Random Access for Modern Multi-Core Machines

Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip
ClickHouse

73 33,909 10.0 C++

ClickHouse® is a free analytics DBMS for big data

https://github.com/ClickHouse/ClickHouse/pull/45878
Also, we still have optimizations for reading Parquet from S3 coming so that might improve

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
1 project | news.ycombinator.com | 2 Apr 2024
Erasure Coding versus Tail Latency
1 project | news.ycombinator.com | 28 Mar 2024
Build time is a collective responsibility
2 projects | news.ycombinator.com | 24 Mar 2024
Fair Benchmarking Considered Difficult (2018) [pdf]
2 projects | news.ycombinator.com | 10 Mar 2024
Writing UDF for Clickhouse using Golang
2 projects | dev.to | 27 Feb 2024

Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Database Cpp11 Parquet Dbms
Post date: 3 Apr 2023

ryu

arrow-tools

WorkOS

parquet-format

fast_float

rapidgzip

ClickHouse

Related posts

Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Big Data Database Cpp11 Parquet Dbms Post date: 3 Apr 2023

ryu

arrow-tools

WorkOS

parquet-format

fast_float

rapidgzip

ClickHouse

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Database Cpp11 Parquet Dbms
Post date: 3 Apr 2023