Top 21 Data processing Open-Source Projects
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
List of libraries, tools and APIs for web scraping and data processing.Project mention: A central repository for scrapping scripts | reddit.com/r/webscraping | 2021-02-22
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai (by activeloopai)Project mention: The hand-picked selection of the best Python libraries released in 2021 | reddit.com/r/Python | 2021-12-21
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.Project mention: [D] Efficiently loading videos in PyTorch without extracting frames | reddit.com/r/MachineLearning | 2021-10-26
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operationsProject mention: Enzyme: Towards state-of-the-art AutoDiff in Rust | reddit.com/r/rust | 2021-12-12
I don't think any of the major ML projects have GPU acceleration because ndarray doesn't support it.
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.Project mention: How to convert a JSON file to CSV file with Golang. | reddit.com/r/golang | 2021-12-08
If you're just looking for a utility to do it (and a bunch of other stuff), there's dasel.
Concurrent and multi-stage data ingestion and data processing with ElixirProject mention: How we sync Stripe to Postgres | dev.to | 2021-07-08
This was a great excuse to use Elixir's Broadway. A Broadway pipeline consists of one producer and one or more workers. The producer is in charge of producing jobs. The workers consume and work those jobs, each working in parallel. Broadway gives us a few things out of the box:
Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
Large-scale pretraining for dialogueProject mention: I made a Python tool to help you know what to say! | reddit.com/r/socialskills | 2021-10-30
I learned about GPT-3 and its strength as a generative model but couldn't access it yet (can't afford the API). Thankfully I found a GPT-2 based pre-trained model DialoGPT that was trained on Reddit.
A light-weight, flexible, and expressive data validation library for dataframes
A list about Apache KafkaProject mention: Resources for learning Kafka | reddit.com/r/devops | 2021-10-30
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.Project mention: How to make http request with curl on certain page after being authenticated? | reddit.com/r/commandline | 2021-10-14
I built Xidel for such authenticated requests:
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.Project mention: New command-line parser with 35+ opt-in features developed for 5 months needs your feedback | reddit.com/r/node | 2021-06-05
I have been working on a command-line parser for one of my open source projects (pxi) for about 5 months now. Today I have reached a milestone and wanted to collect feedback before I move on:
An open source framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud.Project mention: [D] For those of you who don't own a GPU, how do you run your experiments or train your models? | reddit.com/r/MachineLearning | 2021-12-19
At work for non-ML/non-GPU stuff I've been using Lithops for running code on dynamically-provisioned cloud resources (serverless or VM). It pickles your code & runtime variables, sends them to cloud storage, runs the code & downloads the results, all relatively transparently. You're just calling Python functions with Python objects on your local computer and not having to worry about deploying your code, packaging your data, etc. Better still, you can scale up for things like hyperparameter sweeps by just dispatching more calls in parallel, and it will provision more resources.
convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.Project mention: Framework for Data ETL with multiple export templates ? | reddit.com/r/Python | 2021-07-14
If you are considering low-level options and looking for high flexibility and small data processing overheads, you can check out convtools library.
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02
Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.
Dataframe structure and operations in Rust
A distributed data processing framework in Haskell.
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16
Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: No-Code Self-Service BI/Data Analytics Tool | news.ycombinator.com | 2021-11-13
Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table: ``ColumnA=ColumnB+ColumnC, ColumnD=ColumnAColumnE`` we get a graph of column computations* similar to the graph of cell dependencies in spreadsheets.
Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example: ``Table1::ColumnA=FUNCTION(Table2::ColumnB, Table3::ColumnC)`` Different systems provided different answers to this question but all of them are highly specific and rather limited.
Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.
One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the *Prosto* data processing toolkit which is an alternative to map-reduce and SQL:
It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.
Moreover, now it provides *Column-SQL* which makes it even easier to define new columns in terms of other columns:
UX-Dataflow is a streaming capable data multiplexer that allows you to aggregate data and then process it using a Chain of Responsibility design pattern.Project mention: UX Dataflow is a streaming capable data multiplexer | reddit.com/r/angular_rust | 2021-04-14
Rosbag parser written in pure GoProject mention: Analyze Robotics Data in Pure Go | news.ycombinator.com | 2021-03-29
Data processing related posts
Fq: Jq for Binary Formats
19 projects | news.ycombinator.com | 22 Dec 2021
Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON
8 projects | news.ycombinator.com | 22 Dec 2021
Enzyme: Towards state-of-the-art AutoDiff in Rust
3 projects | reddit.com/r/rust | 12 Dec 2021
Announcing Rust CUDA 0.2
3 projects | reddit.com/r/rust | 5 Dec 2021
No-Code Self-Service BI/Data Analytics Tool
1 project | news.ycombinator.com | 13 Nov 2021
Show HN: Hamilton, a Microframework for Creating Dataframes
6 projects | news.ycombinator.com | 8 Nov 2021
Signal processing library
7 projects | reddit.com/r/rust | 6 Nov 2021
What are some of the best open-source Data processing projects? This list will help you:
Are you hiring? Post a new remote job listing for free.