Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push. Learn more →
Top 13 Python Data processing Projects
AI Vector Database for LLMs/LangChain. Doubles as a Data Lake for Deep Learning. Store, query, version, & visualize any data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.aiProject mention: [P] I built a Chatbot to talk with any Github Repo. 🪄 | reddit.com/r/MachineLearning | 2023-04-29
This repository contains two Python scripts that demonstrate how to create a chatbot using Streamlit, OpenAI GPT-3.5-turbo, and Activeloop's Deep Lake. The chatbot searches a dataset stored in Deep Lake to find relevant information and generates responses based on the user's input.
A light-weight, flexible, and expressive statistical data testing libraryProject mention: Unit testing functions that input/output dataframes? | reddit.com/r/datascience | 2023-03-05
I use Pandera, so I just need to define the expected input/output schemas (i.e. column names, types, and constraints on them), and Pandera automatically generates fake data for the unit tests, and validates the result: https://github.com/unionai-oss/pandera
ONLYOFFICE Docs — document collaboration in your environment. Powerful document editing and collaboration in your app or environment. Ultimate security, API and 30+ ready connectors, SaaS or on-premises
Large-scale pretraining for dialogueProject mention: Just a thought | reddit.com/r/replika | 2023-02-08
Large-scale pretrained models for goal-directed dialogProject mention: Fine-tuning on Sales data? | reddit.com/r/OpenAI | 2023-04-19
I would use something like GODEL for something like this. https://github.com/microsoft/GODEL
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀Project mention: Lithops: A multi-cloud framework for embarrassingly parallel jobs | news.ycombinator.com | 2023-01-14
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
convtools is a python library to declaratively define conversions for processing collections, doing complex aggregations and joins.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupbyProject mention: Show HN: PRQL 0.2 – Releasing a better SQL | news.ycombinator.com | 2022-06-27
> Joins are what makes relational modeling interesting!
It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model  which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit  and its Column-SQL language. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.
 Joins vs. Links or Relational Join Considered Harmful https://www.researchgate.net/publication/301764816_Joins_vs_...
 https://github.com/asavinov/prosto data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
Sweet data-centric foundation model fine-tuningProject mention: Fondant: Easily build and share datasets for foundation model fine-tuning | news.ycombinator.com | 2023-05-23
A framework for rapid development of robust data pipelines following a simple design patternProject mention: Show HN: SmartPipeline, robust and light data pipelines in Python | news.ycombinator.com | 2023-05-03
simple functional pipes
Experimental headless data wrangling / refining tool over MongoDB, inspired by OpenRefineProject mention: Mongorefine: Headless data wrangling/refining with NoSQL data using MongoDB | news.ycombinator.com | 2022-09-20
Python Data processing related posts
Fondant: Easily build and share datasets for foundation model fine-tuning
1 project | news.ycombinator.com | 23 May 2023
Fine-tuning on Sales data?
1 project | reddit.com/r/OpenAI | 19 Apr 2023
Lithops: A multi-cloud framework for embarrassingly parallel jobs
1 project | news.ycombinator.com | 14 Jan 2023
Godel: Large-Scale Pre-Training for Goal-Directed Dialog
1 project | news.ycombinator.com | 27 Jun 2022
Microsoft AI Researchers Open-Source 'GODEL': A Large Scale Pre-Trained Language Model For Dialog
1 project | reddit.com/r/singularity | 26 Jun 2022
No-Code Self-Service BI/Data Analytics Tool
1 project | news.ycombinator.com | 13 Nov 2021
Show HN: Hamilton, a Microframework for Creating Dataframes
6 projects | news.ycombinator.com | 8 Nov 2021
A note from our sponsor - CodiumAI
codium.ai | 29 May 2023
What are some of the best open-source Data processing projects in Python? This list will help you: