Top 14 Data processing Open-Source Projects
-
-
miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Project mention: Consultare un databate XML, JSON, CVS o RDF | reddit.com/r/ItalyInformatica | 2021-03-31 -
Scout APM
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
-
rust-ndarray
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
well you can represent categorical values in `ndarray` for sure (even structured arrays [here](https://github.com/rust-ndarray/ndarray/issues/32)), but the memory has to be contiguous for BLAS/LAPACK and therefore it is impossible to mix continuous and categorical values. I was thinking that we could emulate categorical values with a descriptor field for the type of each feature and then just use floats to represent them.
-
I am very happy to announce that the second article in this series has been completed after about a month and a half of research, planning, and writing! The project is Broadway, an Elixir library for building data processing pipelines for data sources like message queues.
-
-
xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
xidel is nice for crawling / scraping sites from the command line
-
pxi
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
-
-
-
forte
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/
Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.
-
Skytrax-Data-Warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.
-
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
> This is closer to the way that humans perceive the world — mapping between whatever aspect of external reality you are interested in and the data model is an order of magnitude easier than with relational databases.
One approach to modeling data based on mappings (mathematical functions) is the concept-oriented model [1] implemented in [2]. Its main feature is that it gets rid of joins, groupby and map-reduce by manipulating data using operations with functions (mappings).
> Everything is pre-joined — you don’t have to disassemble objects into normalised tables and reassemble them with joins.
One old related general idea is to assume the existence of universal relation. Such an approach is referred to as the universal relation model (URM) [3, 4].
[1] A. Savinov, Concept-oriented model: Modeling and processing data using functions, Eprint: arXiv:1911.07225 [cs.DB], 2019 https://www.researchgate.net/publication/337336089_Concept-o...
[2] https://github.com/asavinov/prosto Prosto Data Processing Toolkit: No join-groupby, No map-reduce
[3] https://en.wikipedia.org/wiki/Universal_relation_assumption
[4] R. Fagin, A.O. Mendelzon and J.D. Ullman, A Simplified Universal Relation Assumption and Its Properties. ACM Trans. Database Syst., 7(3), 343-360 (1982).
-
-
ux-dataflow
UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.
Project mention: UX Dataflow is a streaming capable data multiplexer | reddit.com/r/angular_rust | 2021-04-14
Index
What are some of the best open-source Data processing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-web-scraping | 4,103 |
2 | miller | 2,710 |
3 | rust-ndarray | 1,705 |
4 | broadway | 1,353 |
5 | DialoGPT | 1,219 |
6 | xidel | 346 |
7 | pxi | 249 |
8 | utah | 130 |
9 | distributed-fork | 111 |
10 | forte | 100 |
11 | Skytrax-Data-Warehouse | 55 |
12 | prosto | 25 |
13 | go-rosbag | 3 |
14 | ux-dataflow | 2 |