Top 23 Data processing Open-Source Projects

miller

63 8,542 9.1 Go

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Project mention: Qsv: Efficient CSV CLI Toolkit | news.ycombinator.com | 2023-12-22

Bash-Oneliner

18 8,095 3.4

A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
DALI

5 4,902 9.6 C++

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

Project mention: [D] Will data augmentations work faster on TPUs? | /r/MachineLearning | 2023-12-07

Another option is DALI https://github.com/NVIDIA/DALI For my project while training EfficientNet2, it was a game changer. But it a way harder to implement in code than TorchVision or Kornia.

dasel

44 4,856 8.2 Go

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

Project mention: jq 1.7 Released | news.ycombinator.com | 2023-09-06

rust-ndarray

20 3,307 8.1 Rust

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

Project mention: Some Reasons to Avoid Cython | news.ycombinator.com | 2023-09-22

I would love some examples of how to do non-trivial data interop between Rust and Python. My experience is that PyO3/Maturin is excellent when converting between simple datatypes but conversions get difficult when there are non-standard types, e.g. Python Numpy arrays or Rust ndarrays or whatever other custom thing.
Polars seems to have a good model where it uses the Arrow in memory format, which has implementations in Python and Rust, and makes a lot of the ndarray stuff easier. However, if the Rust libraries are not written with Arrow first, they become quite hard to work with. For example, there are many libraries written with https://github.com/rust-ndarray/ndarray, which is challenging to interop with Numpy.
(I am not an expert at all, please correct me if my characterizations are wrong!)

pandera

7 2,994 8.9 Python

A light-weight, flexible, and expressive statistical data testing library
DialoGPT

7 2,315 0.0 Python

Large-scale pretraining for dialogue
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
broadway

11 2,287 6.0 Elixir

Concurrent and multi-stage data ingestion and data processing with Elixir

Project mention: Switching to Elixir | news.ycombinator.com | 2023-11-09

You can actually have "background jobs" in very different ways in Elixir.
> I want background work to live on different compute capacity than http requests, both because they have very different resources usage
In Elixir, because of the way the BEAM works (the unit of parallelism is much cheaper and consume a low amount of memory), "incoming http requests" and related "workers" are not as expensive (a lot less actually) compared to other stacks (for instance Ruby and Python), where it is quite critical to release "http workers" and not hold the connection (which is what lead to the creation of background job tools like Resque, DelayedJob, Sidekiq, Celery...).
This means that you can actually hold incoming HTTP connections a lot longer without troubles.
A consequence of this is that implementing "reverse proxies", or anything calling third party servers _right in the middle_ of your own HTTP call, is usually perfectly acceptable (something I've done more than a couple of times, the latest one powering the reverse proxy behind https://transport.data.gouv.fr - code available at https://github.com/etalab/transport-site/tree/master/apps/un...).
As a consequence, what would be a bad pattern in Python or Ruby (holding the incoming HTTP connection) is not a problem with Elixir.
> because I want to have state or queues in front of background work so there's a well-defined process for retry, error handling, and back-pressure.
Unless you deal with immediate stuff like reverse proxying or cheap "one off async tasks" (like recording a metric), there also are solutions to have more "stateful" background works in Elixir, too.
A popular background job queue is https://github.com/sorentwo/oban (roughly similar to Sidekiq at al), which uses Postgres.
It handles retries, errors etc.
But it's not the only solution, as you have other tools dedicated to processing, such as Broadway (https://github.com/dashbitco/broadway), which handles back-pressure, fault-tolerance, batching etc natively.
You also have more simple options, such as flow (https://github.com/dashbitco/flow), gen_stage (https://github.com/elixir-lang/gen_stage), Task.async_stream (https://hexdocs.pm/elixir/1.12/Task.html#async_stream/5) etc.
It allows to use the "right tool for the job" quite easily.
It is also interesting to note there is no need to "go evented" if you need to fetch data from multiple HTTP servers: it can happen in the exact same process (even: in a background task attached to your HTTP server), as done here https://transport.data.gouv.fr/explore (if you zoom you will see vehicle moving in realtime, and ~80 data sources are being polled every 10 seconds & broadcasted to the visitors via pubsub & websockets).

bytewax

18 1,139 9.8 Python

Python Stream Processing

Project mention: Building a streaming SQL engine with Arrow and DataFusion | news.ycombinator.com | 2024-03-18

GODEL

5 832 3.4 Python

Large-scale pretrained models for goal-directed dialog

Project mention: Microsoft: Large-scale pretrained models for goal-directed dialog | news.ycombinator.com | 2023-06-05

hstream

1 691 9.5 Haskell

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications. (by hstreamdb)

Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12

xidel

18 650 5.9 Pascal

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

Project mention: Move over jq I found something easier: fx | news.ycombinator.com | 2023-06-06

You could try Xidel[1]. It supports JSON, XML and HTML using XPath/XQuery 3.1
It has some extensions to the standard that are pretty nice (JSONiq, CSS selectors, html “template” matching), but you can limit it to just standard XPath/XQuery if you like.
I recommend getting the nightly v .99 build if you give it a try, the stable .98 version is pretty old and I’ve had no issues with .99
1. https://www.videlibri.de/xidel.html

collapse

2 598 9.6 C

Advanced and Fast Data Transformation in R (by SebKrantz)
awesome-kafka

1 565 4.7

A list about Apache Kafka
etl

1 336 9.4 PHP

PHP - ETL (Extract Transform Load) data processing library (by flow-php)
fondant

4 316 9.7 Python

Production-ready data processing made easy and shareable

Project mention: 25 million Creative Commons image dataset released! | /r/StableDiffusion | 2023-10-01

Github: https://github.com/ml6team/fondant

lithops

2 304 9.4 Python

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
pxi

4 267 0.0 JavaScript

🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
scramjet

0 254 0.0 JavaScript

Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.
forte

2 235 3.9 Python

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
mech

5 199 7.0 Rust

🦾 Main repository for the Mech programming language. Start here!

Project mention: Reactive Programming Without Functions | news.ycombinator.com | 2024-03-24

There's also https://github.com/mech-lang/mech which is a sort of descendant of Eve https://witheve.com/ . That too seems to be getting close to hiatus. It's a bit of a shame since it seems like quite a nice paradigm for some stuff like GUIs, interactive stuff, and discrete event simulation, but I suppose the paradigm is both a bit obscure and different enough from everything else that it becomes a "boil the ocean" situation where one or a few people try and hack away but aren't really able to get much traction and eventually tired themselves out.

convtools-ita

3 183 0.0 Python

convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.
incubator-wayang

18 167 9.3 Java

Apache Wayang(incubating) is the first cross-platform data processing system.

Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-24.

Data processing related posts

Reactive Programming Without Functions
2 projects | news.ycombinator.com | 24 Mar 2024
Pipeline-Oriented Programming [video]
4 projects | news.ycombinator.com | 20 Jan 2024
[D] Will data augmentations work faster on TPUs?
1 project | /r/MachineLearning | 7 Dec 2023
Support different jdbc platforms and multiple instances of same DBMS
1 project | /r/ApacheWayang | 5 Dec 2023
Native kmeans with sparkML in a WayangPlan()
1 project | /r/ApacheWayang | 28 Oct 2023
25 million Creative Commons image dataset released!
1 project | /r/StableDiffusion | 1 Oct 2023
Some Reasons to Avoid Cython
5 projects | news.ycombinator.com | 22 Sep 2023
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Data processing projects? This list will help you:

	Project	Stars
1	miller	8,542
2	Bash-Oneliner	8,095
3	DALI	4,902
4	dasel	4,856
5	rust-ndarray	3,307
6	pandera	2,994
7	DialoGPT	2,315
8	broadway	2,287
9	bytewax	1,139
10	GODEL	832
11	hstream	691
12	xidel	650
13	collapse	598
14	awesome-kafka	565
15	etl	336
16	fondant	316
17	lithops	304
18	pxi	267
19	scramjet	254
20	forte	235
21	mech	199
22	convtools-ita	183
23	incubator-wayang	167