#Data processing

Open-source projects categorized as Data processing
Related topics: #Python #JSON #CLI #Command-line #CSV

Top 14 Data processing Open-Source Projects

  • GitHub repo awesome-web-scraping

    List of libraries, tools and APIs for web scraping and data processing.

    Project mention: A central repository for scrapping scripts | reddit.com/r/webscraping | 2021-02-22
  • GitHub repo miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Project mention: Consultare un databate XML, JSON, CVS o RDF | reddit.com/r/ItalyInformatica | 2021-03-31
  • GitHub repo rust-ndarray

    ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

    Project mention: Linfa has a website now! | reddit.com/r/rust | 2021-03-08

    well you can represent categorical values in `ndarray` for sure (even structured arrays [here](https://github.com/rust-ndarray/ndarray/issues/32)), but the memory has to be contiguous for BLAS/LAPACK and therefore it is impossible to mix continuous and categorical values. I was thinking that we could emulate categorical values with a descriptor field for the type of each feature and then just use floats to represent them.

  • GitHub repo broadway

    Concurrent and multi-stage data ingestion and data processing with Elixir (by dashbitco)

    Project mention: Open-source Deep Dive: Broadway | reddit.com/r/opensource | 2021-04-12

    I am very happy to announce that the second article in this series has been completed after about a month and a half of research, planning, and writing! The project is Broadway, an Elixir library for building data processing pipelines for data sources like message queues.

  • GitHub repo DialoGPT

    Large-scale pretraining for dialogue

  • GitHub repo xidel

    Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

    Project mention: Search web freely from command line? | reddit.com/r/commandline | 2021-04-01

    xidel is nice for crawling / scraping sites from the command line

  • GitHub repo pxi

    🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

    Project mention: List of JSON tools for command line | reddit.com/r/commandline | 2021-03-27
  • GitHub repo utah

    Dataframe structure and operations in Rust

  • GitHub repo distributed-fork

    A distributed data processing framework in Haskell.

  • GitHub repo forte

    Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/

    Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02

    Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.

  • GitHub repo Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16

    Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

  • GitHub repo prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: NoSQL Data Modeling Techniques | news.ycombinator.com | 2021-04-10

    > This is closer to the way that humans perceive the world — mapping between whatever aspect of external reality you are interested in and the data model is an order of magnitude easier than with relational databases.

    One approach to modeling data based on mappings (mathematical functions) is the concept-oriented model [1] implemented in [2]. Its main feature is that it gets rid of joins, groupby and map-reduce by manipulating data using operations with functions (mappings).

    > Everything is pre-joined — you don’t have to disassemble objects into normalised tables and reassemble them with joins.

    One old related general idea is to assume the existence of universal relation. Such an approach is referred to as the universal relation model (URM) [3, 4].

    [1] A. Savinov, Concept-oriented model: Modeling and processing data using functions, Eprint: arXiv:1911.07225 [cs.DB], 2019 https://www.researchgate.net/publication/337336089_Concept-o...

    [2] https://github.com/asavinov/prosto Prosto Data Processing Toolkit: No join-groupby, No map-reduce

    [3] https://en.wikipedia.org/wiki/Universal_relation_assumption

    [4] R. Fagin, A.O. Mendelzon and J.D. Ullman, A Simplified Universal Relation Assumption and Its Properties. ACM Trans. Database Syst., 7(3), 343-360 (1982).

  • GitHub repo go-rosbag

    Rosbag parser written in pure Go

    Project mention: Analyze Robotics Data in Pure Go | news.ycombinator.com | 2021-03-29
  • GitHub repo ux-dataflow

    UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.

    Project mention: UX Dataflow is a streaming capable data multiplexer | reddit.com/r/angular_rust | 2021-04-14
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-04-16.

Index

What are some of the best open-source Data processing projects? This list will help you:

Project Stars
1 awesome-web-scraping 4,103
2 miller 2,710
3 rust-ndarray 1,705
4 broadway 1,353
5 DialoGPT 1,219
6 xidel 346
7 pxi 249
8 utah 130
9 distributed-fork 111
10 forte 100
11 Skytrax-Data-Warehouse 55
12 prosto 25
13 go-rosbag 3
14 ux-dataflow 2