Python Data

Open-source Python projects categorized as Data

Top 23 Python Data Projects

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: self hosted Alternative to | /r/selfhosted | 2022-12-30
  • airbyte

    Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.

    Project mention: airbyte VS cloudquery - a user suggested alternative | | 2023-06-02
  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How do you document a ML research? | /r/mlops | 2022-08-30

    While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: Data sources episode 2: AWS S3 to Postgres Data Sync using Singer | | 2023-05-04

    Link to original blog:

  • Mimesis

    Mimesis is a robust data generator for Python, capable of rapidly producing large volumes of synthetic data for various use cases.

    Project mention: Mimesis allows you toeasily generate detailed dummy datasets. | /r/datascience | 2023-04-12

    Mimesis has well-structured and comprehensive documentation:

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

    Project mention: TensorFlow Datasets (TFDS): a collection of ready-to-use datasets | | 2022-12-21

    I tried Librispeech, a very common dataset for speech recognition, in both HF and TFDS.

    TFDS performed extremely bad.

    First it failed because the official hosting server only allows 5 simultaneous connections, and TFDS totally ignored that and makes up to 50 simultaneous downloads and that breaks. I wonder if anyone actually tested this?

    Then you need to have some computer with 30GB to do the preparation, which might fail on your computer. This is where I stopped. It might be fixed now but it took them 8 months to respond to my issue.

    On HF, it just worked. There was a smaller issue in how the dataset was split up but that is fixed now, and their response was very fast and great.

  • CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers,, among many other sites.

    Project mention: Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ? | /r/dataengineering | 2022-11-17

    We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - - - - - - - -

  • flyte

    Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

    Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01

    Anyone tried Flyte?

  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03

    I've looked at and it looks good, does anyone have experience?

  • PyFunctional

    Python library for creating data pipelines with chain functional programming

    Project mention: Kotlin/Java/Javascript/Scala users do you miss the ability to chain functional operators in Python? | /r/learnpython | 2022-06-05

    I often endup using [pyfunctional]( which gives the ability of using chained functional operators, but I am still kind of sad this is not a builtin solution in Python (please note I am not involved in the pyfunctional project and I do not know the author)

  • PyPika

    PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  • sketch

    AI code-writing assistant that understands data content

    Project mention: Pandas AI – The Future of Data Analysis | | 2023-05-17

    This morning I added a "Related Projects" [3] Section to the Buckaroo docs. If Buckaroo doesn't solve your problem, look at one of the other linked projects (like Mito).




  • monorepo

    The mitosheet package,, and other public Mito code.

    Project mention: Pandas AI – The Future of Data Analysis | | 2023-05-17

    I think the biggest area for growth for LLM based tools for data analysis is around helping users _understand what edits they actually made_.

    I'm a co-founder of a non-AI data code-gen tool for data analysis -- but we also have a basic version of an LLM integration. The problem we see with tooling like Pandas AI (in practice! with real users at enterprises!) is that users make an edit like "remove NaN values" and then get a new dataframe -- but they have no way of checking if the edited dataframe is actually what they want. Maybe the LLM removed NaN values. Maybe it just deleted some random rows!

    The key here: how can users build an understanding of how their data changed, and confirm that the changes made by the LLM are the changes they wanted. In other words, recon!

    We've been experimenting more with this recon step in the AI flow (you can see the final PR here: It takes a similar approach to the top comment (passing a subset of the data to the LLM), and then really focuses in the UI around "what changes were made." There's a lot of opportunity for growth here, I think!

    Any/all feedback appreciated :)

  • Colour

    Colour Science for Python

    Project mention: Colorists work in Resolve? | /r/colorists | 2023-04-30

    People that work on resolve also work on which is most REFERENCE code and complete support there is.

  • diffgram

    Integrate Human Supervision into your Platform. For all Training Data Types, Image, Video, 3D, Text, Geo, Audio, Compound, Grid, LLM, GPT, Conversational, and more.

    Project mention: Open source tool scrapes git commit email addresses to send spam to. | /r/programming | 2022-07-29
  • glom

    ☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️ (by mahmoud)

  • Cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 3.9 Released: Log-loss Support | /r/coolgithubprojects | 2023-05-04
  • pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

    Project mention: Sub library with useful code | /r/learnpython | 2023-05-19
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-06-02.

Python Data related posts


What are some of the best open-source Data projects in Python? This list will help you:

Project Stars
1 Prefect 12,036
2 airbyte 10,796
3 chinese-xinhua 10,079
4 akshare 6,578
5 knowledge-repo 5,328
6 Mage 4,715
7 Mimesis 3,972
8 datasets 3,885
9 CKAN 3,823
10 flyte 3,420
11 TextRecognitionDataGenerator 2,682
12 pandas-datareader 2,656
13 PyFunctional 2,176
14 PyPika 2,023
15 mara-pipelines 2,003
16 sketch 1,785
17 monorepo 1,730
18 Colour 1,708
19 diffgram 1,706
20 glom 1,697
21 Cubes 1,489
22 pycm 1,382
23 pyjanitor 1,147
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives