Python Data

Open-source Python projects categorized as Data

Top 23 Python Data Projects

  • llama_index

    LlamaIndex is a data framework for your LLM applications

    Project mention: Show HN: Route your prompts to the best LLM | | 2024-05-22
  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Prefect: A workflow orchestration tool for data pipelines | | 2024-03-13
  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | | 2024-05-13

    AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

  • pandas-ai

    Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

    Project mention: Using RAG to Build Your IDE Agents | | 2024-06-18

    In this blog, we will build a powerful IDE agent for PandasAI using Dash Agent. Then later on, we'll understand how using RAG can significantly improve LLM responses.

  • chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: 25 Open Source AI Tools to Cut Your Development Time in Half | | 2024-07-11

    Mage AI is a data transforming and integrating framework that allows data scientists and ML engineers to build and automate data pipelines without extensive coding. Data scientists can easily connect to their data sources, ingest data, and build production-ready data pipelines within Mage notebooks.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

  • Mimesis

    Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

  • CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers,, among many other sites.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  • PyPika

    PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

  • PyFunctional

    Python library for creating data pipelines with chain functional programming

    Project mention: Python: Uncovering the Overlooked Core Functionalities | | 2023-07-24

    If you actually think this code is better there's a real library that does this:

  • mito

    The mitosheet package,, and other public Mito code.

    Project mention: Show HN: Excel to Python Compiler | | 2024-05-23

    3. Tables that translate as Pandas dataframes. We support at most one table per sheet, at the tables must be contigious. If the formulas in a column are consistent, then we will try and translate this as a single pandas statement.

    We do not support: pivot tables or complex formulas. When we fail to translate these, we generate TODO statements. We also don’t support graphs or macros - and you won’t see these reflected in the output at all currently.

    *Why we built this:*

    We did YCS20 and built an open source tool called [Mito]( It’s been a good journey since then - we’ve scaled revenue and to over [2k Github stars]( But fundamentally, Mito is a tool that’s useful for Excel users who wanted to start writing Python code more effectively.

    We wanted to take another stab at the Excel -> Python pain point that was more developer focused - that helped developers that have to translate Excel files into Python do this much more quickly. Hence, Pyoneer!

    I’ll be in the comments today if you’ve got feedback, criticism, questions, or comments.

  • sketch

    AI code-writing assistant that understands data content

    Project mention: Ask HN: What have you built with LLMs? | | 2024-02-05

    We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.

    1. sketch (in notebook, ai for pandas)

    2. datadm (open source, "chat with data", with support for the open source LLMs (

    3. Our main product: julyp. (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (

    For your next question, about the stack and deploy:

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  • Colour

    Colour Science for Python

    Project mention: Tailwind Color Palette Generator | | 2024-02-02

    Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others.

    How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).

  • dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: Show HN: Automatically extract data from APIs with dlt and OpenAPI | | 2024-05-29

    - You always have the last say. The generated code is declarative and ready to hack in case we pick the wrong paginator or response entity.

    The tool and dlt are open source, find the code here: and here:

  • glom

    ☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️

    Project mention: Ask HN: How can I get better at writing production-level Python? | | 2023-07-18
  • diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  • meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data discussion

Log in or Post with

Python Data related posts

  • This Week In Python

    5 projects | | 12 Jul 2024
  • Show HN: Automatically extract data from APIs with dlt and OpenAPI

    2 projects | | 29 May 2024
  • How to Build a Chat App with Your Postgres Data using Agent Cloud

    3 projects | | 13 May 2024
  • Cold-(Brew) Outreach: Landing my first big client at a coffee shop

    1 project | | 30 Apr 2024
  • LlamaIndex: A data framework for your LLM applications

    1 project | | 7 Apr 2024
  • LlamaIndex is a data framework for your LLM applications

    1 project | | 28 Mar 2024
  • Ask HN: What have you built with LLMs?

    43 projects | | 5 Feb 2024
  • A note from our sponsor - SaaSHub | 16 Jul 2024
    SaaSHub helps you find the best software and product alternatives Learn more →


What are some of the best open-source Data projects in Python? This list will help you:

Project Stars
1 llama_index 33,618
2 Prefect 15,413
3 airbyte 14,837
4 pandas-ai 11,983
5 chinese-xinhua 10,768
6 akshare 8,724
7 Mage 7,469
8 knowledge-repo 5,452
9 Mimesis 4,341
10 CKAN 4,315
11 datasets 4,234
12 TextRecognitionDataGenerator 3,149
13 pandas-datareader 2,871
14 PyPika 2,438
15 PyFunctional 2,372
16 mito 2,249
17 sketch 2,210
18 mara-pipelines 2,059
19 Colour 2,029
20 dlt 2,009
21 glom 1,853
22 diffgram 1,818
23 meltano 1,697

Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in is all you need to start monitoring your apps. Sign up for our free tier today.