parquet-format
FLiPStackWeekly
parquet-format | FLiPStackWeekly | |
---|---|---|
4 | 82 | |
1,655 | 14 | |
1.8% | - | |
7.2 | 9.9 | |
4 days ago | 1 day ago | |
Thrift | ||
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
parquet-format
-
Summing columns in remote Parquet files using DuckDB
Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
- FLaNK Stack for 4th of July
-
I have question related to Parquet files and AWS Glue
As i read here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md , they are store in Integer formats and these integers represent the number of days (for Date) or number of milliseconds, microseconds or nanoseconds (for DateTime) since 1970-01-01. This works as expected with the parquet file that written by our ETL tool from internal database --> S3, all Data/DateTime columns are Integers, means that in Glue Job, i have to convert these Integers back to Date/Datetime value to do some transformation on them. But when parquet files are written by Spark, they are Date/DateTime (or TimeStamp to be more concise) format not Integers (i checked by read these files again in other Glue Job) and that make me confused.
-
Parquet: More than just “Turbo CSV”
Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.
FLiPStackWeekly
What are some alternatives?
rapidgzip - Gzip Decompression and Random Access for Modern Multi-Core Machines
gorilla-cli - LLMs for your CLI
xgen - Salesforce open-source LLMs with 8k sequence length.
awk-raycaster - Pseudo-3D shooter written completely in gawk using raycasting technique
wizmap - Explore and interpret large embeddings in your browser with interactive visualization! 📍
litellm - Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)
FastSAM - Fast Segment Anything
modelscope - ModelScope: bring the notion of Model-as-a-Service to life.
background-removal-js - Remove backgrounds from images directly in the browser environment with ease and no additional costs or privacy concerns. Explore an interactive demo.
create-nifi-pulsar-flink-apps - How to create a real-time scalable streaming app using Apache NiFi, Apache Pulsar and Apache Flink SQL
graphic-walker - An open source alternative to Tableau. Embeddable visual analytic
FLiP-PulsarSummit2022Asia - FLiP-PulsarSummit2022Asia: Pulsar Summit Asia 2022