Data processing

Top 23 Data processing Open-Source Projects

  • miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

  • Project mention: Qsv: Efficient CSV CLI Toolkit | news.ycombinator.com | 2023-12-22
  • Bash-Oneliner

    A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • DALI

    A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

  • Project mention: [D] Will data augmentations work faster on TPUs? | /r/MachineLearning | 2023-12-07

    Another option is DALI https://github.com/NVIDIA/DALI For my project while training EfficientNet2, it was a game changer. But it a way harder to implement in code than TorchVision or Kornia.

  • dasel

    Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

  • Project mention: jq 1.7 Released | news.ycombinator.com | 2023-09-06
  • rust-ndarray

    ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

  • Project mention: Some Reasons to Avoid Cython | news.ycombinator.com | 2023-09-22

    I would love some examples of how to do non-trivial data interop between Rust and Python. My experience is that PyO3/Maturin is excellent when converting between simple datatypes but conversions get difficult when there are non-standard types, e.g. Python Numpy arrays or Rust ndarrays or whatever other custom thing.

    Polars seems to have a good model where it uses the Arrow in memory format, which has implementations in Python and Rust, and makes a lot of the ndarray stuff easier. However, if the Rust libraries are not written with Arrow first, they become quite hard to work with. For example, there are many libraries written with https://github.com/rust-ndarray/ndarray, which is challenging to interop with Numpy.

    (I am not an expert at all, please correct me if my characterizations are wrong!)

  • pandera

    A light-weight, flexible, and expressive statistical data testing library

  • DialoGPT

    Large-scale pretraining for dialogue

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • broadway

    Concurrent and multi-stage data ingestion and data processing with Elixir

  • Project mention: Switching to Elixir | news.ycombinator.com | 2023-11-09

    You can actually have "background jobs" in very different ways in Elixir.

    > I want background work to live on different compute capacity than http requests, both because they have very different resources usage

    In Elixir, because of the way the BEAM works (the unit of parallelism is much cheaper and consume a low amount of memory), "incoming http requests" and related "workers" are not as expensive (a lot less actually) compared to other stacks (for instance Ruby and Python), where it is quite critical to release "http workers" and not hold the connection (which is what lead to the creation of background job tools like Resque, DelayedJob, Sidekiq, Celery...).

    This means that you can actually hold incoming HTTP connections a lot longer without troubles.

    A consequence of this is that implementing "reverse proxies", or anything calling third party servers _right in the middle_ of your own HTTP call, is usually perfectly acceptable (something I've done more than a couple of times, the latest one powering the reverse proxy behind https://transport.data.gouv.fr - code available at https://github.com/etalab/transport-site/tree/master/apps/un...).

    As a consequence, what would be a bad pattern in Python or Ruby (holding the incoming HTTP connection) is not a problem with Elixir.

    > because I want to have state or queues in front of background work so there's a well-defined process for retry, error handling, and back-pressure.

    Unless you deal with immediate stuff like reverse proxying or cheap "one off async tasks" (like recording a metric), there also are solutions to have more "stateful" background works in Elixir, too.

    A popular background job queue is https://github.com/sorentwo/oban (roughly similar to Sidekiq at al), which uses Postgres.

    It handles retries, errors etc.

    But it's not the only solution, as you have other tools dedicated to processing, such as Broadway (https://github.com/dashbitco/broadway), which handles back-pressure, fault-tolerance, batching etc natively.

    You also have more simple options, such as flow (https://github.com/dashbitco/flow), gen_stage (https://github.com/elixir-lang/gen_stage), Task.async_stream (https://hexdocs.pm/elixir/1.12/Task.html#async_stream/5) etc.

    It allows to use the "right tool for the job" quite easily.

    It is also interesting to note there is no need to "go evented" if you need to fetch data from multiple HTTP servers: it can happen in the exact same process (even: in a background task attached to your HTTP server), as done here https://transport.data.gouv.fr/explore (if you zoom you will see vehicle moving in realtime, and ~80 data sources are being polled every 10 seconds & broadcasted to the visitors via pubsub & websockets).

  • bytewax

    Python Stream Processing

  • Project mention: Building a streaming SQL engine with Arrow and DataFusion | news.ycombinator.com | 2024-03-18
  • GODEL

    Large-scale pretrained models for goal-directed dialog

  • Project mention: Microsoft: Large-scale pretrained models for goal-directed dialog | news.ycombinator.com | 2023-06-05
  • hstream

    HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications. (by hstreamdb)

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

  • Project mention: Move over jq I found something easier: fx | news.ycombinator.com | 2023-06-06

    You could try Xidel[1]. It supports JSON, XML and HTML using XPath/XQuery 3.1

    It has some extensions to the standard that are pretty nice (JSONiq, CSS selectors, html “template” matching), but you can limit it to just standard XPath/XQuery if you like.

    I recommend getting the nightly v .99 build if you give it a try, the stable .98 version is pretty old and I’ve had no issues with .99

    1. https://www.videlibri.de/xidel.html

  • collapse

    Advanced and Fast Data Transformation in R (by SebKrantz)

  • awesome-kafka

    A list about Apache Kafka

  • etl

    PHP - ETL (Extract Transform Load) data processing library (by flow-php)

  • fondant

    Production-ready data processing made easy and shareable

  • Project mention: 25 million Creative Commons image dataset released! | /r/StableDiffusion | 2023-10-01

    Github: https://github.com/ml6team/fondant

  • lithops

    A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

  • pxi

    🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

  • scramjet

    Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.

  • forte

    Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

  • mech

    🦾 Main repository for the Mech programming language. Start here!

  • Project mention: Reactive Programming Without Functions | news.ycombinator.com | 2024-03-24

    There's also https://github.com/mech-lang/mech which is a sort of descendant of Eve https://witheve.com/ . That too seems to be getting close to hiatus. It's a bit of a shame since it seems like quite a nice paradigm for some stuff like GUIs, interactive stuff, and discrete event simulation, but I suppose the paradigm is both a bit obscure and different enough from everything else that it becomes a "boil the ocean" situation where one or a few people try and hack away but aren't really able to get much traction and eventually tired themselves out.

  • convtools-ita

    convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

  • incubator-wayang

    Apache Wayang(incubating) is the first cross-platform data processing system.

  • Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-24.

Data processing related posts

Index

What are some of the best open-source Data processing projects? This list will help you:

Project Stars
1 miller 8,542
2 Bash-Oneliner 8,095
3 DALI 4,902
4 dasel 4,856
5 rust-ndarray 3,307
6 pandera 2,994
7 DialoGPT 2,315
8 broadway 2,287
9 bytewax 1,139
10 GODEL 832
11 hstream 691
12 xidel 650
13 collapse 598
14 awesome-kafka 565
15 etl 336
16 fondant 316
17 lithops 304
18 pxi 267
19 scramjet 254
20 forte 235
21 mech 199
22 convtools-ita 183
23 incubator-wayang 167
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com