Top 23 Data Open-Source Projects

  • GitHub repo SheetJS js-xlsx

    :green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

    Project mention: Is there way to create hierarchical CSV file in JavaScript | reddit.com/r/learnprogramming | 2021-10-13

    That said, if Excel is your goal, you can create Excel files directly from JavaScript using xlsx.js. There's a lot of information there, but I've used this a lot to create and/or read from Excel files in JavaScript.

  • GitHub repo Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

    Project mention: Ask HN: Who is hiring? (October 2021) | news.ycombinator.com | 2021-10-01

    Metabase | https://metabase.com | REMOTE | Full-time | All Roles

  • Syncfusion

    Build stunning web applications quickly using Syncfusion JavaScript UI controls.. The Syncfusion JavaScript suite offers more than 65 cross-platform, responsive, and lightweight JS/HTML5 UI controls for building modern web applications.

  • GitHub repo react-query

    ⚛️ Hooks for fetching, caching and updating asynchronous data in React

    Project mention: Redux data fetching solution for a small app | reddit.com/r/reactjs | 2021-10-13

    Perhaps: https://react-query.tanstack.com/

  • GitHub repo data

    Data and code behind the articles and graphics at FiveThirtyEight

    Project mention: What are some examples of late bloomer players that aren't Kawhi? | reddit.com/r/nba | 2021-10-10

    #39 in RAPTOR (source)

  • GitHub repo awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

  • GitHub repo OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

    Project mention: Reduce words in python which have repeating characters at the end? | reddit.com/r/learnpython | 2021-08-28

    Honestly, this sort of thing is really hard to do programmatically for reasons mentioned below ("bumblebee", "loooove"). You can do it, but if this is the sort of thing you need to do often on large datasets there is a great tool for it: https://openrefine.org/

  • GitHub repo semantic-source

    Parsing, analyzing, and comparing source code across many languages

    Project mention: I just published an experimental `tree-sitter` grammar for Swift! | reddit.com/r/swift | 2021-08-28

    Does anyone here have experience with tree-sitter? If you aren't familiar, tree-sitter is a parser generator tool that builds parsers to use with an incremental parsing library. It's what's responsible for AST parsing on GitHub, for instance.

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | reddit.com/r/datascience | 2021-05-22
  • GitHub repo Bogus

    :card_index: A simple and sane fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

    Project mention: Some useful Libraries for .NET projects | dev.to | 2021-10-02

    Bogus Github Nuget: Install-Package Bogus -Version 33.1.1

  • GitHub repo machine-learning-roadmap

    A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

    Project mention: Machine Learning Roadmap | reddit.com/r/learnmachinelearning | 2021-06-15

    Original article here: https://github.com/mrdbourke/machine-learning-roadmap

  • GitHub repo Countly

    Countly helps you get insights from your application. Available self-hosted or on private cloud.

    Project mention: Azure Notification Hub for On Premises (Recommendations) | reddit.com/r/dotnet | 2021-04-07

    You could use something like https://count.ly/ for push notifications to ios/android

  • GitHub repo airbyte

    Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

    Project mention: Help with designing data pipeline | reddit.com/r/dataengineering | 2021-10-20

    airbyte https://github.com/airbytehq/airbyte sounds like a convenient, well-supported and most importantly open-source solution for your use case of extracting and loading between different DBs

  • GitHub repo Tabulator

    Interactive Tables and Data Grids for JavaScript

    Project mention: How can I add a trend section to my table? | reddit.com/r/learnjavascript | 2021-07-23

    For short tests I use the console.table() command and see the table in the debug console of the software Visual Studio Code. But in the end I would like to see it in a browser, so I am using Tabulator to put it into a

  • GitHub repo akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

  • GitHub repo altair

    ✨⚡️ A beautiful feature-rich GraphQL Client for all platforms. (by altair-graphql)

    Project mention: GraphQL concepts and querying an endpoint - please help | reddit.com/r/graphql | 2021-09-28

    If you just want to test out queries against your server without having to write them in a script, you could use something like Graphiql (another commenter shared the link for that), Postman, or my personal favorite, Altair.

  • GitHub repo react-refetch

    A simple, declarative, and composable way to fetch data for React components

  • GitHub repo Mimesis

    Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

    Project mention: Mimesis is a fake data generator that can be used in Data Science for generating dummy datasets. | reddit.com/r/datascience | 2021-04-03
  • GitHub repo data-transfer-project

    The Data Transfer Project makes it easy for people to transfer their data between online service providers. We are establishing a common framework, including data models and protocols, to enable direct transfer of data both into and out of participating online service providers.

    Project mention: Dé-monopoliser l’internet par l’interopérabilité | reddit.com/r/france | 2021-10-12
  • GitHub repo CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

    Project mention: We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything! | reddit.com/r/datasets | 2021-03-08

    We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.

  • GitHub repo datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

    Project mention: We built a pi controlled hydroponics box that grows your plants 1.5x faster using ML | reddit.com/r/raspberry_pi | 2021-04-26

    but it looks like none of your plants are supported by the plantvillage model, or do I understand something wrong? https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/plant_village.py#L57

  • GitHub repo Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

    Project mention: [D] What pdf parser do you use for paragraph parsing for huggingface models | reddit.com/r/MachineLearning | 2021-07-13

    Parsing PDFs is very non-trivial process. Google and Amazon parses are largely based on OCRing. There are some advanced state-of-the-art NN-based OCR approaches but they are not very stable, but a stable industry standard is Tesseract, and nice all-in-one open source tools that brings a ton of tools together is https://github.com/axa-group/Parsr . hope this helps

  • GitHub repo awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • GitHub repo Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Project mention: ETL Pipelines with Airflow: The Good, the Bad and the Ugly | news.ycombinator.com | 2021-10-08

    If you prefer Scala, then you can try Scio: https://github.com/spotify/scio.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-10-20.

Index

What are some of the best open-source Data projects? This list will help you:

Project Stars
1 SheetJS js-xlsx 27,581
2 Metabase 26,301
3 react-query 23,174
4 data 15,118
5 awesome-bigdata 10,315
6 OpenRefine 8,409
7 semantic-source 8,154
8 knowledge-repo 4,905
9 Bogus 4,893
10 machine-learning-roadmap 4,853
11 Countly 4,824
12 airbyte 4,281
13 Tabulator 4,157
14 akshare 4,108
15 altair 3,703
16 react-refetch 3,411
17 Mimesis 3,377
18 data-transfer-project 3,249
19 CKAN 3,156
20 datasets 3,010
21 Parsr 2,708
22 awesome-json-datasets 2,382
23 Scio 2,206
Find remote jobs at our new job board 99remotejobs.com. There are 36 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Scout APM: A developer's best friend. Try free for 14-days
Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
scoutapm.com