Top 23 Data Open-Source Projects

  • TanStack Query

    🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.

    Project mention: Is React Having An Angular.js Moment? | /r/javascript | 2023-06-05

    Yeah the post links to this issue: While it does look like there are some small issues around server components, the bottom of the thread seems to be resolved, and it looks like this won't be an issue going forwards.

  • SheetJS js-xlsx

    📗 SheetJS Spreadsheet Data Toolkit -- New home

    Project mention: What kind of Programmer / language should I be looking for? | /r/AskProgramming | 2023-05-04

    Sure. I manipulate excel files programatically in the browser all the time. I don't really understand your exact workflow, but I use Javascript with xlsx and React.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

    Project mention: Ask HN: Open-Source Self-Hosted No-Code Platforms? | | 2023-05-13

    The solution really depends on what sort of problems you are trying to solve and who your customers are.

    There are a fair few low-code solutions out there for reporting and data visualisation that are great for finance and marketing teams for example. e.g. ,

    For multipurpose SMB workflows and organisational processes, I have used n8n in the recent past and found it was quite good and incredibly easy to maintain.

    For enterprise processes I'd go with Camunda (solely based on recommendations and not first hand experience). Although only parts of their platform are OSS

    Bear in mind that some of these are not suitable if you want to build something that competes with them while taking their OSS code. But are perfectly fine otherwise.

  • SWR

    React Hooks for Data Fetching

    Project mention: Best options right now for fetching data in next/react ? | /r/reactjs | 2023-06-04
  • data

    Data and code behind the articles and graphics at FiveThirtyEight

    Project mention: [Effortpost] Advanced stats on which players are contributing the most to the Heat's playoff run. | /r/heat | 2023-05-24

    To answer these questions I decided to look at 538’s RAPTOR ratings. RAPTOR uses player tracking data to estimate how much each player contributes on the offensive and defensive ends. The total RAPTOR score should be something like the “number of points a player contributes to his team’s offense and defense per 100 possessions, relative to a league-average player.” Higher is better, best during the regular season has been Nikola Jokic at +14. You can read more about it here or play with an interactive tool on their website here. I don’t really care about the details of why it’s a good statistic, but it seems pretty helpful and most importantly for my purposes you can download the data here for free.

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: self hosted Alternative to | /r/selfhosted | 2022-12-30
  • awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

    Project mention: GIS data for a project. I apologize for the banality of my request and for my English. | /r/datasets | 2023-03-15
  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • airbyte

    Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.

    Project mention: airbyte VS cloudquery - a user suggested alternative | | 2023-06-02
  • chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  • faker

    Generate massive amounts of fake data in the browser and node.js (by faker-js)

    Project mention: Jsonformer: A bulletproof way to generate structured output from LLMs | | 2023-05-02

    Something like this should be integrated with library like

  • semantic-source

    Parsing, analyzing, and comparing source code across many languages

    Project mention: How to Get Started with Tree-Sitter | | 2023-05-28

    ah, easy. it's because support has not been added into which is the tech that powers the GitHub UI. Adding support is pretty easy/mainly glue code [1] that imports the tree sitter API.


  • Bogus

    :card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

    Project mention: Best practices for organising Mock Data & Repositories in Testing | /r/dotnet | 2023-06-02

    To do all this you need test though...and that's where ]Bogus]( and Auto Bogus come in handy. They both generate semi random test data that you can use to populate whatever method you've decided to use. You can setup rules so for specific fields, they have built in generators for common things like names and addresses. Auto Bogus can be used to populate large/complicated objects with data automatically (it can be slow if you don't use .WithRecursiveDepth() or.WithTreeDepth() )

  • prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

    Project mention: Machine Learning, Linear Algebra, and More: Is SQL All You Need? | /r/ProgrammingLanguages | 2023-05-20

    Compared to PRQL that someone commented on in the comments, the syntax on that one is a bit hairy

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • Snowplow

    The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP

    Project mention: Open-source data collection & modeling platform for product analytics | | 2022-08-30

    We’ve also thought about Ops :-). There’s a backend 'Collector' that stores data in Postgres, for instance to use while developing locally, or if you want to get set up quickly. But there’s also full integration with Snowplow, which works seamlessly with an existing Snowplow setup as well.

  • machine-learning-roadmap

    A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

    Project mention: 100+ Must Know Github Repositories For Any Programmer | | 2022-11-17

    7. Machine Learning Roadmap

  • Tabulator

    Interactive Tables and Data Grids for JavaScript

    Project mention: Can you recommend a React table components with virtualization and array indices (instead of key-value pairs)? | /r/reactjs | 2023-01-22

    We use Tabulator for a similar use case at my job, it has a virtual DOM to support thousands of rows pretty well. It's not React based, it's a vanilla JS library. There is a React wrapper here but I don't have experience with that.

  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How do you document a ML research? | /r/mlops | 2022-08-30

    While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.

  • Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

    Project mention: PDF GPT allows you to chat with the contents of your PDF file | /r/ChatGPTPro | 2023-04-12

    I would check out (what lang chain uses) or (probably what unstructured copied to get their startup off the ground lol)

  • Countly

    Countly is a product analytics platform that helps teams track, analyze and act-on their user actions and behaviour on mobile, web and desktop applications.

    Project mention: Solution for logging and viewing the logs remotely | /r/FlutterDev | 2023-05-08

    I suggest you use Countly, they have a free edition (Community) also its Self-hosted.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: Data sources episode 2: AWS S3 to Postgres Data Sync using Singer | | 2023-05-04

    Link to original blog:

  • altair

    ✨⚡️ A beautiful feature-rich GraphQL Client for all platforms. (by altair-graphql)

  • browser-compat-data

    This repository contains compatibility data for Web technologies as displayed on MDN

    Project mention: A new home for the Project Fugu API Showcase | /r/webdev | 2023-03-06

    Yes, Mozzila’s browser-compat-data ( is the authoritative source.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-06-05.

Data related posts


What are some of the best open-source Data projects? This list will help you:

Project Stars
1 TanStack Query 34,808
2 SheetJS js-xlsx 32,942
3 Metabase 32,584
4 SWR 26,864
5 data 16,226
6 Prefect 12,078
7 awesome-bigdata 12,040
8 airbyte 10,796
9 chinese-xinhua 10,089
10 faker 9,551
11 semantic-source 8,680
12 Bogus 7,287
13 prql 7,172
14 akshare 6,578
15 Snowplow 6,473
16 machine-learning-roadmap 6,434
17 Tabulator 5,458
18 knowledge-repo 5,332
19 Parsr 5,301
20 Countly 5,167
21 Mage 4,778
22 altair 4,711
23 browser-compat-data 4,503
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives