parquet-format
fast_float
parquet-format | fast_float | |
---|---|---|
4 | 15 | |
1,655 | 1,284 | |
2.4% | 2.0% | |
7.2 | 8.7 | |
5 days ago | about 2 months ago | |
Thrift | C++ | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
parquet-format
-
Summing columns in remote Parquet files using DuckDB
Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
- FLaNK Stack for 4th of July
-
I have question related to Parquet files and AWS Glue
As i read here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md , they are store in Integer formats and these integers represent the number of days (for Date) or number of milliseconds, microseconds or nanoseconds (for DateTime) since 1970-01-01. This works as expected with the parquet file that written by our ETL tool from internal database --> S3, all Data/DateTime columns are Integers, means that in Glue Job, i have to convert these Integers back to Date/Datetime value to do some transformation on them. But when parquet files are written by Spark, they are Date/DateTime (or TimeStamp to be more concise) format not Integers (i checked by read these files again in other Glue Job) and that make me confused.
-
Parquet: More than just “Turbo CSV”
Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.
fast_float
-
Parquet: More than just “Turbo CSV”
> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
-
What do number conversions (from string) cost?
For those that don't know, gcc 12.x updated its float parsing logic to something similar to fast_float and it's about 1/6 of the cost presented here (sub 100 in the graph presented here). Strongly suggest using that library or upgrading the compiler if you need the performance.
-
Can sanitizers find the two bugs I wrote in C++?
This makes sense for integers but betware floating point from_chars - libc++ still doesn't implement it and libstdc++ implements it by wrapping locale-dependent libc functions which involves temporarily changing the thread locale and possibly memory allocation to make the passed string 0-terminated. IMO libstdc++'s checkbox "solution" is worse than not implementing it at all - user's are better off using Lemire's API-compatible fast_float implementation [0].
[0] https://github.com/fastfloat/fast_float
-
Passing Programs To A Stack Machine
I'm a bit stuck on how to do the same thing in c++, due to containers only having a single type. The very inefficient way I'm currently doing it is by passing a program as a vector of strings, and then converting the string constants to doubles with the fast_float library.
-
Parsing can become accidentally quadratic because of sscanf
Just above this comment is a merged PR, which references fast_float library: https://github.com/fastfloat/fast_float
-
Making Rust Float Parsing Fast: libcore Edition
Daniel Lemire @lemire (creator of the algorithm, author of the C++ implementation, and provided constant feedback to help guide the PR).
-
RapidObj v0.1 - A fast, header-only, C++17 library for parsing Wavefront .obj files.
And out of 6,000 lines in the file, at least 3000 are other people's code: earcut for polygon triangulation and fast_float because .obj files typically contain a lot of floating point numbers so it's important to parse them quickly.
-
First release of dragonbox, a fast float-to-string conversion algorithm, is available
How this compares to https://github.com/fastfloat/fast_float ?
-
Why is std::from_chars<float> slow?
I tried to compare it against Daniel Lemire's excellent fast_float library. Fast float took about 180ms for the same program, and all I did was change "std" namespace prefix to "fast_float". It's a factor of 12 difference, at least my machine. I tried MSVC next, and it is a lot better, but it is still ~4 times slower than fast float. AFAIK, clang currently does not implement the feature at all.
-
Iterator invalidation of std::string_view
If you don't mind a 3rd party lib until your stdlib updates, https://github.com/fastfloat/fast_float is best-in-class.
What are some alternatives?
rapidgzip - Gzip Decompression and Random Access for Modern Multi-Core Machines
dragonbox - Reference implementation of Dragonbox in C++
xgen - Salesforce open-source LLMs with 8k sequence length.
rapidobj - A fast, header-only, C++17 library for parsing Wavefront .obj files.
wizmap - Explore and interpret large embeddings in your browser with interactive visualization! 📍
C++ Format - A modern formatting library
FastSAM - Fast Segment Anything
fast-float-rust - Super-fast float parser in Rust (now part of Rust core)
background-removal-js - Remove backgrounds from images directly in the browser environment with ease and no additional costs or privacy concerns. Explore an interactive demo.
RapidJSON - A fast JSON parser/generator for C++ with both SAX/DOM style API
graphic-walker - An open source alternative to Tableau. Embeddable visual analytic
simdutf8 - SIMD-accelerated UTF-8 validation for Rust.