Data processing

Open-source projects categorized as Data processing | Edit details

Top 18 Data processing Open-Source Projects

  • GitHub repo miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Project mention: People who spend most of your time in the terminal, what do you do? | | 2021-10-01

    And it turns out things like "what percent of software in Fedora Linux is under which licenses"? are easier to answer from the command line, and in general, tools like Miller (um, no personal relation) make data-crunching from the command line faster and easier than working with a spreadsheet.

  • GitHub repo awesome-web-scraping

    List of libraries, tools and APIs for web scraping and data processing.

    Project mention: A central repository for scrapping scripts | | 2021-02-22
  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo DALI

    A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

    Project mention: [D] Efficiently loading videos in PyTorch without extracting frames | | 2021-10-26
  • GitHub repo rust-ndarray

    ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

    Project mention: Signal processing library | | 2021-11-06

    I used basic_dsp a while back and found it lacking. I was hoping to find something that uses the ndarray datatype but i'm not seeing this yet. If you're primarily trying to learn though it might not really matter which library you contribute to. As for myself, I just picked the one that was most used and actively worked on at the time. However I keep an eye out on other libraries; if I see something take off, I might switch over. Either way you'll learn and can point to it as work accomplished.

  • GitHub repo broadway

    Concurrent and multi-stage data ingestion and data processing with Elixir

    Project mention: How we sync Stripe to Postgres | | 2021-07-08

    This was a great excuse to use Elixir's Broadway. A Broadway pipeline consists of one producer and one or more workers. The producer is in charge of producing jobs. The workers consume and work those jobs, each working in parallel. Broadway gives us a few things out of the box:

  • GitHub repo DialoGPT

    Large-scale pretraining for dialogue

    Project mention: I made a Python tool to help you know what to say! | | 2021-10-30

    I learned about GPT-3 and its strength as a generative model but couldn't access it yet (can't afford the API). Thankfully I found a GPT-2 based pre-trained model DialoGPT that was trained on Reddit.

  • GitHub repo pandera

    A light-weight, flexible, and expressive data validation library for dataframes

    Project mention: Show HN: Pandera 0.8.0 – validate pandas, dask, modin, and koalas dataframes | | 2021-11-17

    * adds support for mypy static type-linting if you need that extra type safety


  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo awesome-kafka

    A list about Apache Kafka

    Project mention: Resources for learning Kafka | | 2021-10-30
  • GitHub repo xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

    Project mention: How to make http request with curl on certain page after being authenticated? | | 2021-10-14

    I built Xidel for such authenticated requests:

  • GitHub repo pxi

    🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

    Project mention: New command-line parser with 35+ opt-in features developed for 5 months needs your feedback | | 2021-06-05

    I have been working on a command-line parser for one of my open source projects (pxi) for about 5 months now. Today I have reached a milestone and wanted to collect feedback before I move on:

  • GitHub repo convtools-ita

    convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

    Project mention: Framework for Data ETL with multiple export templates ? | | 2021-07-14

    If you are considering low-level options and looking for high flexibility and small data processing overheads, you can check out convtools library.

  • GitHub repo forte

    Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project:

    Project mention: Building Modular and Re-purposable NLP Pipelines | | 2021-03-02

    Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.

  • GitHub repo utah

    Dataframe structure and operations in Rust

  • GitHub repo distributed-fork

    A distributed data processing framework in Haskell.

  • GitHub repo Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Project mention: Open source contributions for a Data Engineer? | | 2021-04-16

    Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

  • GitHub repo prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: No-Code Self-Service BI/Data Analytics Tool | | 2021-11-13

    Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.

    Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.

    Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

    One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:

    It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

    Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:

  • GitHub repo ux-dataflow

    UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.

    Project mention: UX Dataflow is a streaming capable data multiplexer | | 2021-04-14
  • GitHub repo go-rosbag

    Rosbag parser written in pure Go

    Project mention: Analyze Robotics Data in Pure Go | | 2021-03-29
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-11-17.

Data processing related posts


What are some of the best open-source Data processing projects? This list will help you:

Project Stars
1 miller 4,599
2 awesome-web-scraping 4,480
3 DALI 3,600
4 rust-ndarray 1,978
5 broadway 1,670
6 DialoGPT 1,478
7 pandera 815
8 awesome-kafka 458
9 xidel 422
10 pxi 257
11 convtools-ita 178
12 forte 152
13 utah 133
14 distributed-fork 111
15 Skytrax-Data-Warehouse 86
16 prosto 53
17 ux-dataflow 5
18 go-rosbag 4
Find remote jobs at our new job board There are 33 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives