parquet-format vs polars

parquet-format

Apache Parquet (by apache)

Source Code

Suggest alternative

Edit details

polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust (by ritchie46)

dataframe-library Dataframe Dataframes Rust Arrow Python out-of-core polars

Source Code

docs.pola.rs

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

parquet-format		polars
	Project
4	Mentions	144
1,655	Stars	26,514
2.4%	Growth	3.9%
7.2	Activity	10.0
5 days ago	Latest Commit	4 days ago
Thrift	Language	Rust
Apache License 2.0	License	GNU General Public License v3.0 or later

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

parquet-format

Posts with mentions or reviews of parquet-format. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-11-16.

Summing columns in remote Parquet files using DuckDB
4 projects | news.ycombinator.com | 16 Nov 2023

Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
FLaNK Stack for 4th of July
15 projects | dev.to | 3 Jul 2023
I have question related to Parquet files and AWS Glue
1 project | /r/dataengineering | 18 Jun 2023

As i read here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md , they are store in Integer formats and these integers represent the number of days (for Date) or number of milliseconds, microseconds or nanoseconds (for DateTime) since 1970-01-01. This works as expected with the parquet file that written by our ETL tool from internal database --> S3, all Data/DateTime columns are Integers, means that in Glue Job, i have to convert these Integers back to Date/Datetime value to do some transformation on them. But when parquet files are written by Spark, they are Date/DateTime (or TimeStamp to be more concise) format not Integers (i checked by read these files again in other Glue Job) and that make me confused.
Parquet: More than just “Turbo CSV”
7 projects | news.ycombinator.com | 3 Apr 2023

Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.

polars

Posts with mentions or reviews of polars. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-08.

Why Python's Integer Division Floors (2010)
1 project | news.ycombinator.com | 28 Feb 2024

This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
Polars
11 projects | news.ycombinator.com | 8 Jan 2024

https://github.com/pola-rs/polars/releases/tag/py-0.19.0

1 project | /r/programming | 30 Aug 2023
Stuff I Learned during Hanukkah of Data 2023
5 projects | dev.to | 18 Dec 2023

That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
Polars 0.20 Released
1 project | news.ycombinator.com | 16 Dec 2023
Segunda linguagem
3 projects | /r/brdev | 10 Dec 2023
Polars: Dataframes powered by a multithreaded query engine, written in Rust
1 project | news.ycombinator.com | 7 Dec 2023
Summing columns in remote Parquet files using DuckDB
4 projects | news.ycombinator.com | 16 Nov 2023
Polars 0.34 is released. (A query engine focussing on DataFrame front ends)
1 project | /r/u_Dazzling_Finger_8120 | 26 Oct 2023

1 project | /r/rust | 26 Oct 2023

What are some alternatives?

When comparing parquet-format and polars you can also consider the following projects:

rapidgzip - Gzip Decompression and Random Access for Modern Multi-Core Machines

vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

xgen - Salesforce open-source LLMs with 8k sequence length.

modin - Modin: Scale your Pandas workflows by changing a single line of code

wizmap - Explore and interpret large embeddings in your browser with interactive visualization! 📍

datafusion - Apache DataFusion SQL Query Engine

FastSAM - Fast Segment Anything

DataFrames.jl - In-memory tabular data in Julia

background-removal-js - Remove backgrounds from images directly in the browser environment with ease and no additional costs or privacy concerns. Explore an interactive demo.

datatable - A Python package for manipulating 2-dimensional tabular data structures

graphic-walker - An open source alternative to Tableau. Embeddable visual analytic

Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

parquet-format vs rapidgzip polars vs vaex parquet-format vs xgen polars vs modin parquet-format vs wizmap polars vs datafusion parquet-format vs FastSAM polars vs DataFrames.jl parquet-format vs background-removal-js polars vs datatable parquet-format vs graphic-walker polars vs Apache Arrow

Compare parquet-format vs polars and see what are their differences.

parquet-format

polars

parquet-format

polars

What are some alternatives?