C++ data-preprocessing Projects

desbordante-core

2 355 9.5 C++

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11

dali_backend

1 117 6.8 C++

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.

Project mention: Ollama releases OpenAI API compatibility | news.ycombinator.com | 2024-02-08

- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

	Project	Stars
1	desbordante-core	355
2	dali_backend	117

C++ data-preprocessing

C++ data-preprocessing Projects

desbordante-core

dali_backend

InfluxDB

Index