SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Data processing Open-Source Projects
-
miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
-
Bash-Oneliner
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
DALI
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
-
dasel
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
-
rust-ndarray
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
hstream
HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications. (by hstreamdb)
-
xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
-
lithops
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
-
pxi
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
-
scramjet
Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.
-
forte
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
-
convtools-ita
convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Another option is DALI https://github.com/NVIDIA/DALI For my project while training EfficientNet2, it was a game changer. But it a way harder to implement in code than TorchVision or Kornia.
I would love some examples of how to do non-trivial data interop between Rust and Python. My experience is that PyO3/Maturin is excellent when converting between simple datatypes but conversions get difficult when there are non-standard types, e.g. Python Numpy arrays or Rust ndarrays or whatever other custom thing.
Polars seems to have a good model where it uses the Arrow in memory format, which has implementations in Python and Rust, and makes a lot of the ndarray stuff easier. However, if the Rust libraries are not written with Arrow first, they become quite hard to work with. For example, there are many libraries written with https://github.com/rust-ndarray/ndarray, which is challenging to interop with Numpy.
(I am not an expert at all, please correct me if my characterizations are wrong!)
You can actually have "background jobs" in very different ways in Elixir.
> I want background work to live on different compute capacity than http requests, both because they have very different resources usage
In Elixir, because of the way the BEAM works (the unit of parallelism is much cheaper and consume a low amount of memory), "incoming http requests" and related "workers" are not as expensive (a lot less actually) compared to other stacks (for instance Ruby and Python), where it is quite critical to release "http workers" and not hold the connection (which is what lead to the creation of background job tools like Resque, DelayedJob, Sidekiq, Celery...).
This means that you can actually hold incoming HTTP connections a lot longer without troubles.
A consequence of this is that implementing "reverse proxies", or anything calling third party servers _right in the middle_ of your own HTTP call, is usually perfectly acceptable (something I've done more than a couple of times, the latest one powering the reverse proxy behind https://transport.data.gouv.fr - code available at https://github.com/etalab/transport-site/tree/master/apps/un...).
As a consequence, what would be a bad pattern in Python or Ruby (holding the incoming HTTP connection) is not a problem with Elixir.
> because I want to have state or queues in front of background work so there's a well-defined process for retry, error handling, and back-pressure.
Unless you deal with immediate stuff like reverse proxying or cheap "one off async tasks" (like recording a metric), there also are solutions to have more "stateful" background works in Elixir, too.
A popular background job queue is https://github.com/sorentwo/oban (roughly similar to Sidekiq at al), which uses Postgres.
It handles retries, errors etc.
But it's not the only solution, as you have other tools dedicated to processing, such as Broadway (https://github.com/dashbitco/broadway), which handles back-pressure, fault-tolerance, batching etc natively.
You also have more simple options, such as flow (https://github.com/dashbitco/flow), gen_stage (https://github.com/elixir-lang/gen_stage), Task.async_stream (https://hexdocs.pm/elixir/1.12/Task.html#async_stream/5) etc.
It allows to use the "right tool for the job" quite easily.
It is also interesting to note there is no need to "go evented" if you need to fetch data from multiple HTTP servers: it can happen in the exact same process (even: in a background task attached to your HTTP server), as done here https://transport.data.gouv.fr/explore (if you zoom you will see vehicle moving in realtime, and ~80 data sources are being polled every 10 seconds & broadcasted to the visitors via pubsub & websockets).
Project mention: Building a streaming SQL engine with Arrow and DataFusion | news.ycombinator.com | 2024-03-18
Project mention: Microsoft: Large-scale pretrained models for goal-directed dialog | news.ycombinator.com | 2023-06-05
You could try Xidel[1]. It supports JSON, XML and HTML using XPath/XQuery 3.1
It has some extensions to the standard that are pretty nice (JSONiq, CSS selectors, html “template” matching), but you can limit it to just standard XPath/XQuery if you like.
I recommend getting the nightly v .99 build if you give it a try, the stable .98 version is pretty old and I’ve had no issues with .99
Project mention: 25 million Creative Commons image dataset released! | /r/StableDiffusion | 2023-10-01Github: https://github.com/ml6team/fondant
There's also https://github.com/mech-lang/mech which is a sort of descendant of Eve https://witheve.com/ . That too seems to be getting close to hiatus. It's a bit of a shame since it seems like quite a nice paradigm for some stuff like GUIs, interactive stuff, and discrete event simulation, but I suppose the paradigm is both a bit obscure and different enough from everything else that it becomes a "boil the ocean" situation where one or a few people try and hack away but aren't really able to get much traction and eventually tired themselves out.
Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05
Data processing related posts
- Reactive Programming Without Functions
- Pipeline-Oriented Programming [video]
- [D] Will data augmentations work faster on TPUs?
- Support different jdbc platforms and multiple instances of same DBMS
- Native kmeans with sparkML in a WayangPlan()
- 25 million Creative Commons image dataset released!
- Some Reasons to Avoid Cython
-
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024
Index
What are some of the best open-source Data processing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | miller | 8,542 |
2 | Bash-Oneliner | 8,095 |
3 | DALI | 4,902 |
4 | dasel | 4,856 |
5 | rust-ndarray | 3,307 |
6 | pandera | 2,994 |
7 | DialoGPT | 2,315 |
8 | broadway | 2,287 |
9 | bytewax | 1,139 |
10 | GODEL | 832 |
11 | hstream | 691 |
12 | xidel | 650 |
13 | collapse | 598 |
14 | awesome-kafka | 565 |
15 | etl | 336 |
16 | fondant | 316 |
17 | lithops | 304 |
18 | pxi | 267 |
19 | scramjet | 254 |
20 | forte | 235 |
21 | mech | 199 |
22 | convtools-ita | 183 |
23 | incubator-wayang | 167 |