Top 23 Data Open-Source Projects

  • GitHub repo SheetJS js-xlsx

    :green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

    Project mention: [TypeScript] Read spreadsheets by SheetJS | dev.to | 2021-07-21

    SheetJS - Home

  • GitHub repo Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

    Project mention: The Open Source Directory (Open Source Alternatives to Popular B2B Tools) | news.ycombinator.com | 2021-07-21

    Do you know the differences with https://github.com/metabase/metabase ?

  • GitHub repo react-query

    ⚛️ Hooks for fetching, caching and updating asynchronous data in React

    Project mention: React Native app built with zustand and tailwind | dev.to | 2021-07-18

    I took a look at react-query, zustand, jotai and recoil, eventually settling on zustand.

  • GitHub repo data

    Data and code behind the articles and graphics at FiveThirtyEight

    Project mention: HASH or RANGE on distributed databases | dev.to | 2021-07-13

    python -c "from sqlalchemy import create_engine;import pandas;import io;import urllib.request; pandas.read_csv(io.StringIO(urllib.request.urlopen('https://github.com/fivethirtyeight/data/blob/master/avengers/avengers.csv?raw=True').read().decode('utf-8', errors='ignore'))).to_sql('avengers', create_engine('postgresql+psycopg2://franck:[email protected]:5433/yugabyte'), if_exists='replace', method='multi')"

  • GitHub repo awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

  • GitHub repo OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

    Project mention: Great tools to show ETL, Data Visualisation and maybe some Analytics to managers? | reddit.com/r/dataengineering | 2021-05-03

    Interesting, how does that compare to OpenRefine (free software)?

  • GitHub repo semantic-source

    Parsing, analyzing, and comparing source code across many languages

    Project mention: Diffsitter: A tree-sitter based AST difftool to get meaningful semantic diffs | news.ycombinator.com | 2021-07-18
  • GitHub repo machine-learning-roadmap

    A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

    Project mention: Machine Learning Roadmap | reddit.com/r/learnmachinelearning | 2021-06-15

    Original article here: https://github.com/mrdbourke/machine-learning-roadmap

  • GitHub repo knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | reddit.com/r/datascience | 2021-05-22
  • GitHub repo Countly

    Countly helps you get insights from your application. Available self-hosted or on private cloud.

    Project mention: Azure Notification Hub for On Premises (Recommendations) | reddit.com/r/dotnet | 2021-04-07

    You could use something like https://count.ly/ for push notifications to ios/android

  • GitHub repo Bogus

    :card_index: A simple and sane fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

    Project mention: Fake data generator for .NET with Bogus | dev.to | 2021-06-28

    Searching on the internet brought me to good library Bogus. This is a good instrument, has a lot of great stuff:

  • GitHub repo Tabulator

    Interactive Tables and Data Grids for JavaScript

    Project mention: How can I add a trend section to my table? | reddit.com/r/learnjavascript | 2021-07-23

    For short tests I use the console.table() command and see the table in the debug console of the software Visual Studio Code. But in the end I would like to see it in a browser, so I am using Tabulator to put it into a

  • GitHub repo akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

  • GitHub repo airbyte

    Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

    Project mention: Compression while syncing data using ELT tools like fivetran | reddit.com/r/dataengineering | 2021-07-01

    Airbyte has the S3 destination connector which can write data in Parquet, Avro, CSV, JSON, and compress it using one of many codecs. You can use it with any of the supported data sources

  • GitHub repo react-refetch

    A simple, declarative, and composable way to fetch data for React components

  • GitHub repo Mimesis

    Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

    Project mention: Mimesis is a fake data generator that can be used in Data Science for generating dummy datasets. | reddit.com/r/datascience | 2021-04-03
  • GitHub repo data-transfer-project

    The Data Transfer Project makes it easy for people to transfer their data between online service providers. We are establishing a common framework, including data models and protocols, to enable direct transfer of data both into and out of participating online service providers.

    Project mention: Kopia vs Duplicacy | reddit.com/r/Backup | 2021-07-24

    Once/If Data Transfer Project takes off, I’m hoping I will be able to schedule cloud to cloud backups without transferring data home first.

  • GitHub repo CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

    Project mention: We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything! | reddit.com/r/datasets | 2021-03-08

    We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.

  • GitHub repo datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

    Project mention: We built a pi controlled hydroponics box that grows your plants 1.5x faster using ML | reddit.com/r/raspberry_pi | 2021-04-26

    but it looks like none of your plants are supported by the plantvillage model, or do I understand something wrong? https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/plant_village.py#L57

  • GitHub repo Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

    Project mention: [D] What pdf parser do you use for paragraph parsing for huggingface models | reddit.com/r/MachineLearning | 2021-07-13

    Parsing PDFs is very non-trivial process. Google and Amazon parses are largely based on OCRing. There are some advanced state-of-the-art NN-based OCR approaches but they are not very stable, but a stable industry standard is Tesseract, and nice all-in-one open source tools that brings a ton of tools together is https://github.com/axa-group/Parsr . hope this helps

  • GitHub repo awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • GitHub repo Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Project mention: ELT, Data Pipeline | dev.to | 2021-01-01

    To counter the above mentioned problem, we decided to move our data to a Pub/Sub based stream model, where we would continue to push data as it arrives. As fluentd is the primary tool being used in all our servers to gather data, rather than replacing it we leveraged its plugin architecture to use a plugin to stream data into a sink of our choosing. Initially our inclination was towards Google PubSub and Google Dataflow as our Data Scientists/Engineers use Big Query extensively and keeping the data in the same Cloud made sense. The inspiration of using these tools came from Spotify’s Event Delivery – The Road to the Cloud. We did the setup on one of our staging server with Google PubSub and Dataflow. Both didn't really work out for us as PubSub model requires a Subscriber to be available for the Topic a Publisher streams messages to, otherwise the messages are not stored. On top of it there was no way to see which messages are arriving. During this the weirdest thing that we encountered was that the Topic would be orphaned losing the subscribers when working with Dataflow. PubSub we might have managed to live with, the wall in our path was Dataflow. We started off with using SCIO from Spotify to work with Dataflow, there is a considerate lack of documentation over it and found the community to be very reserved on Github, something quite evident in the world of Scala for which they came up with a Code of Conduct for its user base to follow. Something that was required from Dataflow for us was to support batch write option to GCS, after trying our hand at Dataflow to no success to achieve that, Google's staff at StackOverflow were quite responsive and their response confirmed that it was something not available with Dataflow and streaming data to BigQuery, Datastore or Bigtable as a datastore was an option to use. The reason we didn't do that was to avoid high streaming cost to these services to store data, as majority of our jobs from the data team are based on batched hourly data. The initial proposal to the updated pipeline is shown below.

  • GitHub repo pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Best quantitative tools/repos/apis for Sentiment & Social Media analysis of individual Stock/Crypto tickers | reddit.com/r/algotrading | 2021-07-03

    Also Yahoo continually takes steps to discourage programmatic access (the most recent attempt is happening right now: https://github.com/pydata/pandas-datareader/issues/868).

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-07-24.

Index

What are some of the best open-source Data projects? This list will help you:

Project Stars
1 SheetJS js-xlsx 26,226
2 Metabase 25,388
3 react-query 21,411
4 data 14,909
5 awesome-bigdata 10,108
6 OpenRefine 8,263
7 semantic-source 8,089
8 machine-learning-roadmap 4,843
9 knowledge-repo 4,835
10 Countly 4,790
11 Bogus 4,662
12 Tabulator 4,055
13 akshare 3,767
14 airbyte 3,542
15 react-refetch 3,407
16 Mimesis 3,321
17 data-transfer-project 3,213
18 CKAN 3,064
19 datasets 2,911
20 Parsr 2,670
21 awesome-json-datasets 2,347
22 Scio 2,169
23 pandas-datareader 2,069