data-preprocessing

Open-source projects categorized as data-preprocessing

Top 8 data-preprocessing Open-Source Projects

  • skrub

    Prepping tables for machine learning

  • Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
  • machinelearnjs

    Machine Learning library for the web and Node.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

  • Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11
  • convtools-ita

    convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

  • PANDAS-TUTORIAL

    Jupyter Notebooks and Data Sets for Pandas Library (by TirendazAcademy)

  • dali_backend

    The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.

  • Project mention: Ollama releases OpenAI API compatibility | news.ycombinator.com | 2024-02-08

    - While keeping power utilization below X

    They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

    - Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

    - gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.

    - DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.

    vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

    [0] - https://mistral.ai/news/la-plateforme/

    [1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

    [2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

    [3] - https://github.com/triton-inference-server/model_navigator

    [4] - https://github.com/triton-inference-server/client/blob/main/...

    [5] - https://github.com/triton-inference-server/dali_backend

  • prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • degradr

    Python library for realistically degrading images.

  • Project mention: How to generate realistic PSFs for camera lenses? | /r/Optics | 2023-09-07

    (I do plan on making my results open source here, but it's obviously still a work in progress)

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-preprocessing related posts

  • Framework for Data ETL with multiple export templates ?

    1 project | /r/Python | 14 Jul 2021
  • convtools - define conversions, aggregations and joins in functional style (ad-hoc code generation)

    1 project | /r/Python | 8 Jul 2021
  • Writing concise functional code in python

    2 projects | /r/Python | 6 Jul 2021

Index

What are some of the best open-source data-preprocessing projects? This list will help you:

Project Stars
1 skrub 1,010
2 machinelearnjs 536
3 desbordante-core 348
4 convtools-ita 183
5 PANDAS-TUTORIAL 158
6 dali_backend 117
7 prosto 89
8 degradr 11

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com