Top 14 Data processing Open-Source Projects
List of libraries, tools and APIs for web scraping and data processing.Project mention: A central repository for scrapping scripts | reddit.com/r/webscraping | 2021-02-22
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSONProject mention: Consultare un databate XML, JSON, CVS o RDF | reddit.com/r/ItalyInformatica | 2021-03-31
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operationsProject mention: Linfa has a website now! | reddit.com/r/rust | 2021-03-08
well you can represent categorical values in `ndarray` for sure (even structured arrays [here](https://github.com/rust-ndarray/ndarray/issues/32)), but the memory has to be contiguous for BLAS/LAPACK and therefore it is impossible to mix continuous and categorical values. I was thinking that we could emulate categorical values with a descriptor field for the type of each feature and then just use floats to represent them.
Concurrent and multi-stage data ingestion and data processing with Elixir (by dashbitco)Project mention: Open-source Deep Dive: Broadway | reddit.com/r/opensource | 2021-04-12
I am very happy to announce that the second article in this series has been completed after about a month and a half of research, planning, and writing! The project is Broadway, an Elixir library for building data processing pipelines for data sources like message queues.
Large-scale pretraining for dialogue
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.Project mention: Search web freely from command line? | reddit.com/r/commandline | 2021-04-01
xidel is nice for crawling / scraping sites from the command line
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.Project mention: List of JSON tools for command line | reddit.com/r/commandline | 2021-03-27
Dataframe structure and operations in Rust
A distributed data processing framework in Haskell.
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02
Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16
Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: NoSQL Data Modeling Techniques | news.ycombinator.com | 2021-04-10
> This is closer to the way that humans perceive the world — mapping between whatever aspect of external reality you are interested in and the data model is an order of magnitude easier than with relational databases.
One approach to modeling data based on mappings (mathematical functions) is the concept-oriented model  implemented in . Its main feature is that it gets rid of joins, groupby and map-reduce by manipulating data using operations with functions (mappings).
> Everything is pre-joined — you don’t have to disassemble objects into normalised tables and reassemble them with joins.
One old related general idea is to assume the existence of universal relation. Such an approach is referred to as the universal relation model (URM) [3, 4].
 A. Savinov, Concept-oriented model: Modeling and processing data using functions, Eprint: arXiv:1911.07225 [cs.DB], 2019 https://www.researchgate.net/publication/337336089_Concept-o...
 https://github.com/asavinov/prosto Prosto Data Processing Toolkit: No join-groupby, No map-reduce
 R. Fagin, A.O. Mendelzon and J.D. Ullman, A Simplified Universal Relation Assumption and Its Properties. ACM Trans. Database Syst., 7(3), 343-360 (1982).
Rosbag parser written in pure GoProject mention: Analyze Robotics Data in Pure Go | news.ycombinator.com | 2021-03-29
UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.Project mention: UX Dataflow is a streaming capable data multiplexer | reddit.com/r/angular_rust | 2021-04-14
What are some of the best open-source Data processing projects? This list will help you: