Python Data processing

Open-source Python projects categorized as Data processing

Top 13 Python Data processing Projects

  • deeplake

    AI Vector Database for LLMs/LangChain. Doubles as a Data Lake for Deep Learning. Store, query, version, & visualize any data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

    Project mention: [P] I built a Chatbot to talk with any Github Repo. 🪄 | reddit.com/r/MachineLearning | 2023-04-29

    This repository contains two Python scripts that demonstrate how to create a chatbot using Streamlit, OpenAI GPT-3.5-turbo, and Activeloop's Deep Lake. The chatbot searches a dataset stored in Deep Lake to find relevant information and generates responses based on the user's input.

  • pandera

    A light-weight, flexible, and expressive statistical data testing library

    Project mention: Unit testing functions that input/output dataframes? | reddit.com/r/datascience | 2023-03-05

    I use Pandera, so I just need to define the expected input/output schemas (i.e. column names, types, and constraints on them), and Pandera automatically generates fake data for the unit tests, and validates the result: https://github.com/unionai-oss/pandera

  • ONLYOFFICE

    ONLYOFFICE Docs — document collaboration in your environment. Powerful document editing and collaboration in your app or environment. Ultimate security, API and 30+ ready connectors, SaaS or on-premises

  • DialoGPT

    Large-scale pretraining for dialogue

    Project mention: Just a thought | reddit.com/r/replika | 2023-02-08
  • GODEL

    Large-scale pretrained models for goal-directed dialog

    Project mention: Fine-tuning on Sales data? | reddit.com/r/OpenAI | 2023-04-19

    I would use something like GODEL for something like this. https://github.com/microsoft/GODEL

  • lithops

    A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

    Project mention: Lithops: A multi-cloud framework for embarrassingly parallel jobs | news.ycombinator.com | 2023-01-14
  • forte

    Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

  • convtools-ita

    convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

  • prosto

    Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

    Project mention: Show HN: PRQL 0.2 – Releasing a better SQL | news.ycombinator.com | 2022-06-27

    > Joins are what makes relational modeling interesting!

    It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model [1] which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit [2] and its Column-SQL language. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.

    [1] Joins vs. Links or Relational Join Considered Harmful https://www.researchgate.net/publication/301764816_Joins_vs_...

    [2] https://github.com/asavinov/prosto data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

  • fondant

    Sweet data-centric foundation model fine-tuning

    Project mention: Fondant: Easily build and share datasets for foundation model fine-tuning | news.ycombinator.com | 2023-05-23
  • SmartPipeline

    A framework for rapid development of robust data pipelines following a simple design pattern

    Project mention: Show HN: SmartPipeline, robust and light data pipelines in Python | news.ycombinator.com | 2023-05-03
  • pipe21

    simple functional pipes

  • mongorefine

    Experimental headless data wrangling / refining tool over MongoDB, inspired by OpenRefine

    Project mention: Mongorefine: Headless data wrangling/refining with NoSQL data using MongoDB | news.ycombinator.com | 2022-09-20
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-05-23.

Python Data processing related posts

Index

What are some of the best open-source Data processing projects in Python? This list will help you:

Project Stars
1 deeplake 5,991
2 pandera 2,339
3 DialoGPT 2,159
4 GODEL 684
5 lithops 272
6 forte 218
7 convtools-ita 181
8 Skytrax-Data-Warehouse 120
9 prosto 81
10 fondant 44
11 SmartPipeline 13
12 pipe21 7
13 mongorefine 1
TestGPT | Generating meaningful tests for busy devs
Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
codium.ai