Data processing

Open-source projects categorized as Data processing | Edit details

Top 21 Data processing Open-Source Projects

  • GitHub repo miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Project mention: Fq: Jq for Binary Formats | | 2021-12-22

    Miller Csv can process json in record format and has a much saner DSL in my experience.

  • GitHub repo awesome-web-scraping

    List of libraries, tools and APIs for web scraping and data processing.

    Project mention: A central repository for scrapping scripts | | 2021-02-22
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • GitHub repo Activeloop Hub

    Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. (by activeloopai)

    Project mention: The hand-picked selection of the best Python libraries released in 2021 | | 2021-12-21


  • GitHub repo DALI

    A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

    Project mention: [D] Efficiently loading videos in PyTorch without extracting frames | | 2021-10-26
  • GitHub repo rust-ndarray

    ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

    Project mention: Enzyme: Towards state-of-the-art AutoDiff in Rust | | 2021-12-12

    I don't think any of the major ML projects have GPU acceleration because ndarray doesn't support it.

  • GitHub repo dasel

    Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

    Project mention: How to convert a JSON file to CSV file with Golang. | | 2021-12-08

    If you're just looking for a utility to do it (and a bunch of other stuff), there's dasel.

  • GitHub repo broadway

    Concurrent and multi-stage data ingestion and data processing with Elixir

    Project mention: How we sync Stripe to Postgres | | 2021-07-08

    This was a great excuse to use Elixir's Broadway. A Broadway pipeline consists of one producer and one or more workers. The producer is in charge of producing jobs. The workers consume and work those jobs, each working in parallel. Broadway gives us a few things out of the box:

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • GitHub repo DialoGPT

    Large-scale pretraining for dialogue

    Project mention: I made a Python tool to help you know what to say! | | 2021-10-30

    I learned about GPT-3 and its strength as a generative model but couldn't access it yet (can't afford the API). Thankfully I found a GPT-2 based pre-trained model DialoGPT that was trained on Reddit.

  • GitHub repo pandera

    A light-weight, flexible, and expressive data validation library for dataframes

    Project mention: Show HN: Pandera 0.8.0 – validate pandas, dask, modin, and koalas dataframes | | 2021-11-17

    * adds support for mypy static type-linting if you need that extra type safety


  • GitHub repo awesome-kafka

    A list about Apache Kafka

    Project mention: Resources for learning Kafka | | 2021-10-30
  • GitHub repo xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

    Project mention: How to make http request with curl on certain page after being authenticated? | | 2021-10-14

    I built Xidel for such authenticated requests:

  • GitHub repo pxi

    🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

    Project mention: New command-line parser with 35+ opt-in features developed for 5 months needs your feedback | | 2021-06-05

    I have been working on a command-line parser for one of my open source projects (pxi) for about 5 months now. Today I have reached a milestone and wanted to collect feedback before I move on:

  • GitHub repo lithops

    An open source framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud.

    Project mention: [D] For those of you who don't own a GPU, how do you run your experiments or train your models? | | 2021-12-19

    At work for non-ML/non-GPU stuff I've been using Lithops for running code on dynamically-provisioned cloud resources (serverless or VM). It pickles your code & runtime variables, sends them to cloud storage, runs the code & downloads the results, all relatively transparently. You're just calling Python functions with Python objects on your local computer and not having to worry about deploying your code, packaging your data, etc. Better still, you can scale up for things like hyperparameter sweeps by just dispatching more calls in parallel, and it will provision more resources.

  • GitHub repo convtools-ita

    convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

    Project mention: Framework for Data ETL with multiple export templates ? | | 2021-07-14

    If you are considering low-level options and looking for high flexibility and small data processing overheads, you can check out convtools library.

  • GitHub repo forte

    Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project:

    Project mention: Building Modular and Re-purposable NLP Pipelines | | 2021-03-02

    Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.

  • GitHub repo utah

    Dataframe structure and operations in Rust

  • GitHub repo distributed-fork

    A distributed data processing framework in Haskell.

  • GitHub repo Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Project mention: Open source contributions for a Data Engineer? | | 2021-04-16

    Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

  • GitHub repo prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: No-Code Self-Service BI/Data Analytics Tool | | 2021-11-13

    Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.

    Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.

    Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

    One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:

    It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

    Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:

  • GitHub repo ux-dataflow

    UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.

    Project mention: UX Dataflow is a streaming capable data multiplexer | | 2021-04-14
  • GitHub repo go-rosbag

    Rosbag parser written in pure Go

    Project mention: Analyze Robotics Data in Pure Go | | 2021-03-29
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-12-22.

Data processing related posts


What are some of the best open-source Data processing projects? This list will help you:

Project Stars
1 miller 4,918
2 awesome-web-scraping 4,546
3 Activeloop Hub 4,200
4 DALI 3,674
5 rust-ndarray 2,053
6 dasel 1,757
7 broadway 1,719
8 DialoGPT 1,526
9 pandera 933
10 awesome-kafka 466
11 xidel 431
12 pxi 257
13 lithops 181
14 convtools-ita 178
15 forte 154
16 utah 133
17 distributed-fork 111
18 Skytrax-Data-Warehouse 95
19 prosto 52
20 ux-dataflow 5
21 go-rosbag 5
Find remote jobs at our new job board There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.