Top 8 data-preprocessing Open-Source Projects

skrub

1 1,010 8.9 Python

Prepping tables for machine learning

Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08

machinelearnjs

1 536 0.0 TypeScript

Machine Learning library for the web and Node.
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
desbordante-core

2 348 9.5 C++

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11

convtools-ita

3 183 0.0 Python

convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.
PANDAS-TUTORIAL

4 158 4.0 Jupyter Notebook

Jupyter Notebooks and Data Sets for Pandas Library (by TirendazAcademy)
dali_backend

1 117 6.8 C++

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.

Project mention: Ollama releases OpenAI API compatibility | news.ycombinator.com | 2024-02-08

- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend

prosto

9 89 3.6 Python

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
degradr

1 11 8.0 Python

Python library for realistically degrading images.

Project mention: How to generate realistic PSFs for camera lenses? | /r/Optics | 2023-09-07

(I do plan on making my results open source here, but it's obviously still a work in progress)

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-preprocessing related posts

Framework for Data ETL with multiple export templates ?

1 project | /r/Python | 14 Jul 2021
convtools - define conversions, aggregations and joins in functional style (ad-hoc code generation)

1 project | /r/Python | 8 Jul 2021
Writing concise functional code in python

2 projects | /r/Python | 6 Jul 2021