Top 8 data-preprocessing Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
-
convtools-ita
convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.
-
dali_backend
The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
-
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend
(I do plan on making my results open source here, but it's obviously still a work in progress)
data-preprocessing related posts
Index
What are some of the best open-source data-preprocessing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | skrub | 1,010 |
2 | machinelearnjs | 536 |
3 | desbordante-core | 348 |
4 | convtools-ita | 183 |
5 | PANDAS-TUTORIAL | 158 |
6 | dali_backend | 117 |
7 | prosto | 89 |
8 | degradr | 11 |
Sponsored