SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Parquet Open-Source Projects
-
You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
-
Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
Thanks for the detailed feedback @snidane!
As maintainer of qsv, here's my reply:
- Given qsv's rapid release cycle (173 releases over three years), the auto-update check is essential at the moment. Once we reach 1.0, I'll turn it off. For now, given your feedback, I've only made it check 10% of the time.
- Pivot is in the backlog and I'll be sure to add unpivot when I implement it. (https://github.com/jqnatividad/qsv/issues/799)
- I'll add a dedicated summing command with the group by (-by) and window by (-over) capability (https://github.com/jqnatividad/qsv/issues/1514). Do note that `stats` has basic sum as @ezequiel-garzon pointed out.
- With the `enum` command, qsv can achieve what you proposed with `laminate`. E.g. qsv enum --new-column newcol --constant newconstant mydata.csv --output laminated-data.csv
- With the cat rowskey command, qsv can already concatenate files with mismatched headers.
- other file formats. qsv supports parquet, csv, tsv, excel, ods, datapackage, sqlite and more (see https://github.com/jqnatividad/qsv/tree/master#file-formats). Fixed-format though is not supported yet and quite interesting, and have added it to the backlog (https://github.com/jqnatividad/qsv/issues/1515)
- as to "enable embedding outputs of commands", qsv is composable by design, so you can use standard stdin/stdout redirection/piping techniques to have it work with other CLI tools like jq, awk, etc.
Finally, just released v0.120.0 that already incorporates the less aggressive self-update check. https://github.com/jqnatividad/qsv/releases/tag/0.120.0
-
Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30
Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.
-
petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
-
Project mention: Summing columns in remote Parquet files using DuckDB | news.ycombinator.com | 2023-11-16
Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
rill
Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code. (by rilldata)
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24FYI: ADAM seems to do that
-
-
Cinchoo ETL
ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
-
-
kglab
Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
-
-
vscode-data-preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
-
-
parquet2
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
-
-
Project mention: Launch HN: Grai (YC S22) – Open-Source Data Observability Platform | news.ycombinator.com | 2023-07-17
Elastic v2 if one is interested in such things: https://github.com/grai-io/grai-core/blob/v0.1.33/LICENSE
-
-
amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
-
nodejs-polars is node-specific and uses native FFI. polars can be compiled to Wasm but doesn't yet have a js API out of the box.
As for the fastest way to serialize data to Pandas data to the browser, you should use Parquet; it's the fastest to write on the Python side and read on the JS side, while also being compressed. See https://github.com/kylebarron/parquet-wasm (full disclosure, I wrote this)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Parquet related posts
- Show HN: Vector-Io: Universal Vector Data Import/Export
- cryo: NEW Data - star count:778.0
- cryo: NEW Data - star count:778.0
- cryo: NEW Data - star count:778.0
- cryo: NEW Data - star count:778.0
- cryo: NEW Data - star count:778.0
- cryo: NEW Data - star count:778.0
-
A note from our sponsor - SaaSHub
www.saashub.com | 29 Mar 2024
Index
What are some of the best open-source Parquet projects? This list will help you:
Project | Stars | |
---|---|---|
1 | dsq | 3,516 |
2 | roapi | 3,030 |
3 | Apache Parquet | 2,374 |
4 | qsv | 2,174 |
5 | Apache Drill | 1,877 |
6 | petastorm | 1,739 |
7 | parquet-format | 1,615 |
8 | quilt | 1,310 |
9 | rill | 1,291 |
10 | adam | 965 |
11 | cryo | 931 |
12 | Cinchoo ETL | 729 |
13 | ParquetViewer | 618 |
14 | kglab | 546 |
15 | pystore | 527 |
16 | vscode-data-preview | 518 |
17 | parquetjs | 342 |
18 | parquet2 | 342 |
19 | parquet4s | 271 |
20 | grai-core | 266 |
21 | pqrs | 242 |
22 | amazon-s3-find-and-forget | 230 |
23 | parquet-wasm | 216 |