Python Data

Open-source Python projects categorized as Data

Top 23 Python Data Projects

  1. llama_index

    LlamaIndex is the leading framework for building LLM-powered agents over your data.

    Project mention: Complete Large Language Model (LLM) Learning Roadmap | dev.to | 2025-04-11

    Resource: LlamaIndex Documentation

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pandas-ai

    Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

    Project mention: PandaAI: Talk to Your Data, Not to Your Code! | dev.to | 2025-05-06

    View the Project on GitHub

  4. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02

    - https://github.com/PrefectHQ/prefect

  5. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Personal Picks: Data Product News (April 16, 2025) | dev.to | 2025-04-15
  6. akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  7. chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  8. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

  11. superduper

    Superduper: End-to-end framework for building custom AI applications and agents.

    Project mention: Build fully portable AI applications on top of Snowflake with SuperDuperDB | dev.to | 2024-06-26

    Customize how AI and databases work together. Scale your AI projects to handle more data and users. Move AI projects between different environments easily. Extend the system with new AI features and database functionality. Check it out: Blog: https://blog.superduperdb.com/version-02 Github: https://github.com/SuperDuperDB/superduperdb (leave us a star ⭐️🥳)

  12. CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

    Project mention: CKAN – The open source data management system | news.ycombinator.com | 2024-12-04
  13. Mimesis

    Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

    Project mention: Mimesis: The Fake Data Generator That Will Blow Your Mind! | dev.to | 2025-05-08

    View the Project on GitHub

  14. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  15. cognita

    RAG (Retrieval Augmented Generation) Framework for building modular, open source applications for production by TrueFoundry

    Project mention: Lists of open-source frameworks for building RAG applications | dev.to | 2025-01-02

    Ideal For: Enterprises seeking a robust framework for large-scale AI applications. GitHub Repository

  16. dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: Data Loading Tool | news.ycombinator.com | 2024-12-14
  17. TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  18. preswald

    Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.

    Project mention: Revolutionizing Data Apps: Build Interactive Dashboards with Just Python! | dev.to | 2025-03-19

    View the Project on GitHub

  19. pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  20. PyPika

    PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

    Project mention: FastAPI, Pydantic, Psycopg3: the holy trinity for Python web APIs | dev.to | 2024-10-24

    PyPika: I don't know much about this one.

  21. PyFunctional

    Python library for creating data pipelines with chain functional programming

  22. Colour

    Colour Science for Python

    Project mention: How much oranger do red orange bags make oranges look? | news.ycombinator.com | 2025-04-14

    There are also color science packages like this one that let you do conversions to various spaces - https://www.colour-science.org/

  23. sketch

    AI code-writing assistant that understands data content

  24. mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  25. meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data discussion

Log in or Post with

Python Data related posts

  • Mimesis: The Fake Data Generator That Will Blow Your Mind!

    1 project | dev.to | 8 May 2025
  • Personal Picks: Data Product News (April 16, 2025)

    1 project | dev.to | 15 Apr 2025
  • airbyte VS cocoindex - a user suggested alternative

    2 projects | 1 Apr 2025
  • Automate structured data extraction from PDF / Word by OpenAI and CocoIndex

    4 projects | dev.to | 28 Mar 2025
  • Revolutionizing Data Apps: Build Interactive Dashboards with Just Python!

    1 project | dev.to | 19 Mar 2025
  • Glom – Restructuring data, the Python way

    1 project | news.ycombinator.com | 25 Feb 2025
  • Quick tip: Replace MongoDB® Atlas with SingleStore Kai in LlamaIndex

    1 project | dev.to | 21 Jan 2025
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 17 May 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Data projects in Python? This list will help you:

# Project Stars
1 llama_index 41,591
2 pandas-ai 20,042
3 Prefect 19,241
4 airbyte 18,103
5 akshare 11,722
6 chinese-xinhua 11,112
7 Mage 8,312
8 knowledge-repo 5,518
9 superduper 5,058
10 CKAN 4,703
11 Mimesis 4,566
12 datasets 4,409
13 cognita 4,053
14 dlt 3,615
15 TextRecognitionDataGenerator 3,469
16 preswald 3,648
17 pandas-datareader 3,031
18 PyPika 2,673
19 PyFunctional 2,447
20 Colour 2,269
21 sketch 2,257
22 mara-pipelines 2,082
23 meltano 2,054

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?