Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ryu

    Converts floating point numbers to decimal strings (by ulfjack)

    > There isn't really a CSV standard that defines the precise grammar of CSV.

    Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

    JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.

    Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

    Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?

    JSON.

  • arrow-tools

    A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet

    If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • parquet-format

    Apache Parquet

    Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.

    The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.

    https://github.com/apache/parquet-format/blob/master/Logical...

    Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.

  • fast_float

    Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12 and WebKit/Safari

    > Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

    It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...

  • rapidgzip

    Gzip Decompression and Random Access for Modern Multi-Core Machines

    Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip

  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

    https://github.com/ClickHouse/ClickHouse/pull/45878

    Also, we still have optimizations for reading Parquet from S3 coming so that might improve

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts