Parquet

Open-source projects categorized as Parquet

Top 23 Parquet Open-Source Projects

  • dsq

    Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

    Project mention: Tracking SQLite Database Changes in Git | news.ycombinator.com | 2023-11-02

    You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils

    For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)

  • roapi

    Create full-fledged APIs for slowly moving datasets without writing a single line of code.

    Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • Apache Parquet

    Apache Parquet

  • qsv

    CSVs sliced, diced & analyzed.

    Project mention: Qsv: Efficient CSV CLI Toolkit | news.ycombinator.com | 2023-12-22

    Thanks for the detailed feedback @snidane!

    As maintainer of qsv, here's my reply:

    - Given qsv's rapid release cycle (173 releases over three years), the auto-update check is essential at the moment. Once we reach 1.0, I'll turn it off. For now, given your feedback, I've only made it check 10% of the time.

    - Pivot is in the backlog and I'll be sure to add unpivot when I implement it. (https://github.com/jqnatividad/qsv/issues/799)

    - I'll add a dedicated summing command with the group by (-by) and window by (-over) capability (https://github.com/jqnatividad/qsv/issues/1514). Do note that `stats` has basic sum as @ezequiel-garzon pointed out.

    - With the `enum` command, qsv can achieve what you proposed with `laminate`. E.g. qsv enum --new-column newcol --constant newconstant mydata.csv --output laminated-data.csv

    - With the cat rowskey command, qsv can already concatenate files with mismatched headers.

    - other file formats. qsv supports parquet, csv, tsv, excel, ods, datapackage, sqlite and more (see https://github.com/jqnatividad/qsv/tree/master#file-formats). Fixed-format though is not supported yet and quite interesting, and have added it to the backlog (https://github.com/jqnatividad/qsv/issues/1515)

    - as to "enable embedding outputs of commands", qsv is composable by design, so you can use standard stdin/stdout redirection/piping techniques to have it work with other CLI tools like jq, awk, etc.

    Finally, just released v0.120.0 that already incorporates the less aggressive self-update check. https://github.com/jqnatividad/qsv/releases/tag/0.120.0

  • Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data (by apache)

    Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30

    Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.

  • petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

  • parquet-format

    Apache Parquet

    Project mention: Summing columns in remote Parquet files using DuckDB | news.ycombinator.com | 2023-11-16

    Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format

    The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • quilt

    Quilt is a data mesh for connecting people with actionable data

  • rill

    Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code. (by rilldata)

    Project mention: Governments on GitHub | news.ycombinator.com | 2023-06-09
  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

    Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24

    FYI: ADAM seems to do that

  • cryo

    cryo is the easiest way to extract blockchain data to parquet, csv, json, or python dataframes

    Project mention: cryo: NEW Data - star count:778.0 | /r/algoprojects | 2023-12-09
  • Cinchoo ETL

    ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)

  • ParquetViewer

    Simple windows desktop application for viewing & querying Apache Parquet files

  • kglab

    Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.

  • pystore

    Fast data store for Pandas time-series data

  • vscode-data-preview

    Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files

  • parquetjs

    fully asynchronous, pure JavaScript implementation of the Parquet file format

  • parquet2

    Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

  • parquet4s

    Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.

  • grai-core

    Project mention: Launch HN: Grai (YC S22) – Open-Source Data Observability Platform | news.ycombinator.com | 2023-07-17

    Elastic v2 if one is interested in such things: https://github.com/grai-io/grai-core/blob/v0.1.33/LICENSE

  • pqrs

    Command line tool for inspecting Parquet files

  • amazon-s3-find-and-forget

    Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

  • parquet-wasm

    Rust-based WebAssembly bindings to read and write Apache Parquet data

    Project mention: Goodbye, Node.js Buffer | news.ycombinator.com | 2023-10-24

    nodejs-polars is node-specific and uses native FFI. polars can be compiled to Wasm but doesn't yet have a js API out of the box.

    As for the fastest way to serialize data to Pandas data to the browser, you should use Parquet; it's the fastest to write on the Python side and read on the JS side, while also being compressed. See https://github.com/kylebarron/parquet-wasm (full disclosure, I wrote this)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-12-22.

Parquet related posts

Index

What are some of the best open-source Parquet projects? This list will help you:

Project Stars
1 dsq 3,516
2 roapi 3,030
3 Apache Parquet 2,374
4 qsv 2,174
5 Apache Drill 1,877
6 petastorm 1,739
7 parquet-format 1,615
8 quilt 1,310
9 rill 1,291
10 adam 965
11 cryo 931
12 Cinchoo ETL 729
13 ParquetViewer 618
14 kglab 546
15 pystore 527
16 vscode-data-preview 518
17 parquetjs 342
18 parquet2 342
19 parquet4s 271
20 grai-core 266
21 pqrs 242
22 amazon-s3-find-and-forget 230
23 parquet-wasm 216
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com