parquet-format
ryu
parquet-format | ryu | |
---|---|---|
4 | 12 | |
1,655 | 1,158 | |
2.4% | - | |
7.2 | 5.9 | |
5 days ago | 3 months ago | |
Thrift | C++ | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
parquet-format
-
Summing columns in remote Parquet files using DuckDB
Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
- FLaNK Stack for 4th of July
-
I have question related to Parquet files and AWS Glue
As i read here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md , they are store in Integer formats and these integers represent the number of days (for Date) or number of milliseconds, microseconds or nanoseconds (for DateTime) since 1970-01-01. This works as expected with the parquet file that written by our ETL tool from internal database --> S3, all Data/DateTime columns are Integers, means that in Glue Job, i have to convert these Integers back to Date/Datetime value to do some transformation on them. But when parquet files are written by Spark, they are Date/DateTime (or TimeStamp to be more concise) format not Integers (i checked by read these files again in other Glue Job) and that make me confused.
-
Parquet: More than just “Turbo CSV”
Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.
ryu
-
Printing double aka the most difficult problem in computer sciences
Nah. This is about ryu printf.
-
Parquet: More than just “Turbo CSV”
> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
- Faster way to convert double to string, not using "%f"?
-
After obtaning a CS degree and 16 years of experience in industry, I feel somewhat confident that I can answer your programming questions correctly. Ask me anything
Me and Ryu agree that the answer should be 0.30000000000000004
-
23 years into my career, I still love PHP and JavaScript
Apparently exact minimal float-to-string conversion is more recent than I thought, and many languages used to print more (Python?) or less (PHP) decimal digits than necessary to uniquely identify the bit pattern. Python correctly prints 46000.80 + 553.04 as 46553.840000000004, but I don't know if it ever prints more digits than needed. One recent algorithm for printing floats exactly is https://github.com/ulfjack/ryu, though I'm unaware what's the state-of-the-art (https://github.com/jk-jeon/dragonbox claims to be a benchmark and the best algorithm).
-
What's the most elegant algo in your subjective view and why?
On the huge speedup side, you have the Ryū algorithm for decimal conversion (Video, Source), which is now finding its way in most standard libraries. But it isn't a hack, and a very dense, complex and precise algo, nothing like the fast-and-loose inverse square root.
-
C++ devs at FAANG companies, what kind of work do you do?
Used a wizard's magic to print "3.14" faster
-
how to make ftoa procedure from scratch
Here's a paper that details an optimized algorithm (reference implementation). It also contains a description of a correct, but slow algorithm as well as references to classic papers on the subject. Earlier the classic implementation was the dtoa one included in netlib by David Gay.
-
Dragonbox 1.1.0 is released (a fast float-to-string conversion algorithm)
At the very core of all these theoretical stuffs, there is the theory of continued fractions. This is an immensely useful monster which I even dare call as the ultimate tool for floating-point formatting/parsing that everyone who wants to contribute in this field should learn. Before I learned continued fractions, my main tool for proving stuffs was the minmax Euclid algorithm (which is one of the greatest contributions of the wonderful Ryu paper), but it turns out that it is actually just a quite straightforward application of the theory of continued fractions. The main role minmax Euclid algorithm played was to estimate the maximum size of possible errors, but with continued fractions it is even possible to find the list of all examples that generate errors above a given threshold. This is something I desperately wanted but really couldn't do back in 2020.
-
FastDoubleParser: Java port of Daniel Lemires fast_double_parser
Ryū algorithm, the converse (doubles to strings), is also much faster than using Java's number formatting classes.
https://github.com/ulfjack/ryu/blob/master/src/main/java/inf...
What are some alternatives?
rapidgzip - Gzip Decompression and Random Access for Modern Multi-Core Machines
dragonbox - Reference implementation of Dragonbox in C++
xgen - Salesforce open-source LLMs with 8k sequence length.
C++ Format - A modern formatting library
wizmap - Explore and interpret large embeddings in your browser with interactive visualization! 📍
concise-encoding - The secure data format for a modern world
FastSAM - Fast Segment Anything
proust - Compiling implementation of mustache
background-removal-js - Remove backgrounds from images directly in the browser environment with ease and no additional costs or privacy concerns. Explore an interactive demo.
graphic-walker - An open source alternative to Tableau. Embeddable visual analytic
itoa - Fast integer to ascii / integer to string conversion