Top 23 Data Open-Source Projects

  • sheetjs

    :green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

  • metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

  • react-query

    ⚛️ Hooks for fetching, caching and updating asynchronous data in React

    Latest mention: What does this error mean and how do I resolve it? | reddit.com/r/typescript | 2021-01-14

    The specific error in the useMutation() source code seems to be that they're doing Function Overloading in TS, and they're misidentifying MutationKey as your intended argument but you're trying to provide a function that returns a promise, a MutationFunction, so that specific error message is (arguably) a mistake by TS. Perhaps the MutationFunction generics could be set, to help TS figure out your types.

  • data

    Data and code behind the articles and graphics at FiveThirtyEight

    Latest mention: Sometimes you really do get what you deserve | reddit.com/r/instantkarma | 2021-01-12

    First let's take a look at criminals indicted in each administration. Data here.

  • awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

  • semantic

    Parsing, analyzing, and comparing source code across many languages

    Latest mention: New Haskell Foundation to Foster Haskell Adoption, Raises 200k USD | news.ycombinator.com | 2020-12-27

    PostgREST is written in Haskell and quite popular: https://postgrest.org/en/v7.0.0/

    Shellcheck has become a very important tool to many people writing shell scripts: https://github.com/koalaman/shellcheck

    GitHub's semantic analysis is written in Haskell: https://github.com/github/semantic

    I'm not sure why Pandoc doesn't count.

    I think it is definitely the case that most examples of successful Haskell are technical programs of interest to hackers. I don't think that's much of a problem myself.

  • OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

  • countly-server

    Countly helps you get insights from your application. Available self-hosted or on private cloud.

  • tabulator

    Interactive Tables and Data Grids for JavaScript

  • Bogus

    :card_index: A simple and sane fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

  • react-refetch

    A simple, declarative, and composable way to fetch data for React components

  • mimesis

    Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

  • ckan

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, europeandataportal.eu/data, data.humdata.org among many other sites.

  • akshare

    AkShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

  • awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Latest mention: ELT, Data Pipeline | dev.to | 2021-01-01

    To counter the above mentioned problem, we decided to move our data to a Pub/Sub based stream model, where we would continue to push data as it arrives. As fluentd is the primary tool being used in all our servers to gather data, rather than replacing it we leveraged its plugin architecture to use a plugin to stream data into a sink of our choosing. Initially our inclination was towards Google PubSub and Google Dataflow as our Data Scientists/Engineers use Big Query extensively and keeping the data in the same Cloud made sense. The inspiration of using these tools came from Spotify’s Event Delivery – The Road to the Cloud. We did the setup on one of our staging server with Google PubSub and Dataflow. Both didn't really work out for us as PubSub model requires a Subscriber to be available for the Topic a Publisher streams messages to, otherwise the messages are not stored. On top of it there was no way to see which messages are arriving. During this the weirdest thing that we encountered was that the Topic would be orphaned losing the subscribers when working with Dataflow. PubSub we might have managed to live with, the wall in our path was Dataflow. We started off with using SCIO from Spotify to work with Dataflow, there is a considerate lack of documentation over it and found the community to be very reserved on Github, something quite evident in the world of Scala for which they came up with a Code of Conduct for its user base to follow. Something that was required from Dataflow for us was to support batch write option to GCS, after trying our hand at Dataflow to no success to achieve that, Google's staff at StackOverflow were quite responsive and their response confirmed that it was something not available with Dataflow and streaming data to BigQuery, Datastore or Bigtable as a datastore was an option to use. The reason we didn't do that was to avoid high streaming cost to these services to store data, as majority of our jobs from the data team are based on batched hourly data. The initial proposal to the updated pipeline is shown below.

  • stats

    A well tested and comprehensive Golang statistics library package with no dependencies.

  • lens

    Lenses, Folds, and Traversals - Join us on freenode #haskell-lens

  • kea

    Production Ready State Management for React

  • kiba

    Data processing & ETL framework for Ruby

    Latest mention: Ruby 3.0.0 Released | news.ycombinator.com | 2020-12-24

    in ruby you have the async gems from https://github.com/socketry/async

    i dont know for data manipulation but i guess also ruby has it. maybe this https://github.com/thbar/kiba

  • gofakeit

    Random fake data generator written in go

    Latest mention: Gofakeit v6. Now supports concurrency and crypto/rand | news.ycombinator.com | 2021-01-18

    The new v6 release now supports the ability to have localized rand that allows for concurrent generating of random data. Crypto/rand has now been integrated as well as if you have a custom rand that fulfills the math/rand source64 interface you can use it with gofakeit. Let me know your thoughts.

    https://github.com/brianvoe/gofakeit

  • cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

  • CodeSearchNet

    Datasets, tools, and benchmarks for representation learning of code.

    Latest mention: Speedtyper.dev: Type racing for programmers | reddit.com/r/webdev | 2021-01-16

    https://github.com/github/CodeSearchNet#downloading-data-from-s3

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

What are some of the best open-source Data projects? This list will help you:

Project Stars
1 sheetjs 24,178
2 metabase 23,468
3 react-query 16,815
4 data 14,407
5 awesome-bigdata 9,615
6 semantic 7,889
7 OpenRefine 7,794
8 countly-server 4,660
9 tabulator 3,754
10 Bogus 3,634
11 react-refetch 3,386
12 mimesis 3,192
13 ckan 2,850
14 akshare 2,813
15 awesome-json-datasets 2,186
16 scio 2,047
17 stats 1,884
18 lens 1,716
19 kea 1,690
20 kiba 1,517
21 gofakeit 1,388
22 cubes 1,379
23 CodeSearchNet 1,346