Open-source projects categorized as Data
Related topics: #Network #Haskell #JSON #Comonads #Hw

Top 23 Data Open-Source Projects

  • GitHub repo SheetJS js-xlsx

    :green_book: SheetJS Community Edition -- Spreadsheet Data Toolkit

    Project mention: How can I map data from an excel sheet through react? | reddit.com/r/reactjs | 2021-02-19

    React is a library that only cares about rendering JavaScript data on the DOM and related stuff. If you need to read an Excel file and transform it into a JavaScript object, you need to use the html file picker and, once you load the blob, a library to parse this blob. Last time I used https://github.com/SheetJS/sheetjs

  • GitHub repo Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

    Project mention: Ask HN: Who is hiring? (March 2021) | news.ycombinator.com | 2021-03-01

    Metabase | https://metabase.com | REMOTE | Full-time | Backend, Frontend, Full Stack, and DevOps engineers

    We're hiring for multiple positions across the Engineering team (and in many other non-engineering roles).

    Metabase is open source analytics software that makes it easy for anyone in your company to ask questions of your data. It interfaces with a number of databases / data warehouses (BigQuery, Redshift, Snowflake, Postgres, MySQL, etc) and has a simple and powerful UI and UX that sits on top of it.

  • Scout

    Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo react-query

    ⚛️ Hooks for fetching, caching and updating asynchronous data in React

    Project mention: Make a Progressive Web App with React | dev.to | 2021-03-01

    // /src/MadLib.js import { useQuery } from 'react-query'; import { useParams, Link } from 'react-router-dom'; import { useState, useEffect } from 'react'; import BlockContent from '@sanity/block-content-to-react'; import { sanity, imageUrlBuilder } from './sanity'; import styles from './MadLib.module.css'; const query = ` *[ _type == 'madLib' && slug.current == $slug ] `; function MadLib() { // this variable is populated from `react-router` which pulls it from the URL const { slug } = useParams(); // data is fetched from sanity via the sanity client and stored into // application state via react-query. note that the slug is used as the // "query key": https://react-query.tanstack.com/guides/query-keys const { data = [] } = useQuery(slug, () => sanity.fetch(query, { slug })); // we'll use destructuring assignment to return the first mab lib const [madLib] = data; // this will store the state of the answers of this mad lib const [answers, setAnswers] = useState( // if the items exist in localStorage, then localStorage.getItem(slug) ? // then set the initial state to that value JSON.parse(localStorage.getItem(slug)) : // otherwise, set the initial state to an empty object {}, ); // this is a react "effect" hook: https://reactjs.org/docs/hooks-effect.html // we use this to watch for changes in the `slug` or `answers` variables and // update local storage when those change. useEffect(() => { localStorage.setItem(slug, JSON.stringify(answers)); }, [slug, answers]); if (!madLib) { return

    Loading…h1>; } // once the mad lib is loaded, we can map through the structured content to // find our placeholder shape. the end result is an array of these placeholders const placeholders = madLib?.story .map((block) => block.children.filter((n) => n._type === 'placeholder')) .flat(); // using the above placeholders array, we calculate whether or not all the // blanks are filled in by checking the whether every placeholder has a value // in the `answers` state variable. const allBlanksFilledIn = placeholders?.every( (placeholder) => answers[placeholder._key], ); return ( <>

    {madLib.title}h2> {madLib.title} {!allBlanksFilledIn ? ( // if all the blanks are _not_ filled in, then we can show the form <>

    Fill in the blank!p>

    When you're done, the finished mad lib will appear.p>

    { e.preventDefault(); const answerEntries = Array.from( // find all the inputs e.currentTarget.querySelectorAll('input'), ) // then get the name and values in a tuple .map((inputEl) => [inputEl.name, inputEl.value]); // use `Object.fromEntries` to transform them back to an object const nextAnswers = Object.fromEntries(answerEntries); setAnswers(nextAnswers); }} >
      {/* for each placeholder… */} {placeholders.map(({ _key, type }) => (
    • {/* …render an input an a label. */} {type} label> li> ))} ul> Submit!button> form> > ) : ( // if all the blanks are filled in, then we can show the rendered // story with a custom serializer for the type `placeholder` <> answers[_key] }, }} /> { // we reset the state on click after the users confirms it's okay. if (window.confirm('Are you sure you want to reset?')) { setAnswers({}); } }} > Reset button> {/* this is a simple link back to the main mab libs index */} ← More Mad Libs Link> > )} > ); } export default MadLib;

  • GitHub repo data

    Data and code behind the articles and graphics at FiveThirtyEight

    Project mention: Our Data | fivethirtyeight | reddit.com/r/datascience | 2021-02-26
  • GitHub repo awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

    Project mention: Foundational Distributed Systems Papers | news.ycombinator.com | 2021-03-01

    From "Ask HN: Learning about distributed systems?" https://news.ycombinator.com/item?id=23932271 :

    > Papers-we-love > Distributed Systems: https://github.com/papers-we-love/papers-we-love/tree/master...

    > awesome-distributed-systems also has many links to theory: https://github.com/theanalyst/awesome-distributed-systems

    > awesome-bigdata lists a number of tools: https://github.com/onurakpolat/awesome-bigdata

  • GitHub repo semantic-source

    Parsing, analyzing, and comparing source code across many languages

    Project mention: Saw a Tweet about Haskell+Servant being replaced with NodeJS in a project due to compile times - will compile times ever get better? | reddit.com/r/haskell | 2021-02-23

    Most languages’ compilers aren’t powerful enough for tradeoffs to enter the equation. GHC is an exception: the more you ask of the compiler, the more you have to pay in compile times. An example of a PR that gives up concision for compile time speed is here.

  • GitHub repo OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

    Project mention: had the toughest IF statement of my life yesterday | reddit.com/r/ProgrammerHumor | 2021-03-02
  • GitHub repo Countly

    Countly helps you get insights from your application. Available self-hosted or on private cloud.

    Project mention: Firebase Analytics Alternative for Kid's Category iOS App | reddit.com/r/iosdev | 2021-02-16

    Depending on the server resources at your disposal, you could roll out your own installation of the Countly server together with its iOS SDK. That way you are not using a third party solution for analytics so you are not breaking the rules while still obtaining analytics data.

  • GitHub repo Bogus

    :card_index: A simple and sane fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

    Project mention: Easily Create Mock Data for Unit Tests | dev.to | 2021-02-25

    In the previous example, we hard-coded values for the mock values. There is an excellent nuget package that can generate fake data that will resemble real data: Bogus. You can easily integrate Bogus, and use it inside the builder constructor for initial values.

  • GitHub repo Tabulator

    Interactive Tables and Data Grids for JavaScript

  • GitHub repo react-refetch

    A simple, declarative, and composable way to fetch data for React components

  • GitHub repo Mimesis

    Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

  • GitHub repo akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

  • GitHub repo CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

  • GitHub repo awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • GitHub repo Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Project mention: ELT, Data Pipeline | dev.to | 2021-01-01

    To counter the above mentioned problem, we decided to move our data to a Pub/Sub based stream model, where we would continue to push data as it arrives. As fluentd is the primary tool being used in all our servers to gather data, rather than replacing it we leveraged its plugin architecture to use a plugin to stream data into a sink of our choosing. Initially our inclination was towards Google PubSub and Google Dataflow as our Data Scientists/Engineers use Big Query extensively and keeping the data in the same Cloud made sense. The inspiration of using these tools came from Spotify’s Event Delivery – The Road to the Cloud. We did the setup on one of our staging server with Google PubSub and Dataflow. Both didn't really work out for us as PubSub model requires a Subscriber to be available for the Topic a Publisher streams messages to, otherwise the messages are not stored. On top of it there was no way to see which messages are arriving. During this the weirdest thing that we encountered was that the Topic would be orphaned losing the subscribers when working with Dataflow. PubSub we might have managed to live with, the wall in our path was Dataflow. We started off with using SCIO from Spotify to work with Dataflow, there is a considerate lack of documentation over it and found the community to be very reserved on Github, something quite evident in the world of Scala for which they came up with a Code of Conduct for its user base to follow. Something that was required from Dataflow for us was to support batch write option to GCS, after trying our hand at Dataflow to no success to achieve that, Google's staff at StackOverflow were quite responsive and their response confirmed that it was something not available with Dataflow and streaming data to BigQuery, Datastore or Bigtable as a datastore was an option to use. The reason we didn't do that was to avoid high streaming cost to these services to store data, as majority of our jobs from the data team are based on batched hourly data. The initial proposal to the updated pipeline is shown below.

  • GitHub repo Stats

    A well tested and comprehensive Golang statistics library package with no dependencies.

  • GitHub repo PyFunctional

    Python library for creating data pipelines with chain functional programming

    Project mention: The pipe-operator to python |&gt; | reddit.com/r/Python | 2021-02-21
  • GitHub repo airbyte

    Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

    Project mention: Meltano ELT: Open-Source DataOps for the DevOps Era | news.ycombinator.com | 2021-02-28

    They have the right idea, it's a real need and a major opportunity. Airbyte[1] is many steps ahead. I also expect that the Airflow community will also wade into these waters at some point.

    1. https://airbyte.io/

  • GitHub repo lens

    Lenses, Folds, and Traversals - Join us on freenode #haskell-lens (by ekmett)

  • GitHub repo gofakeit

    Random fake data generator written in go

    Project mention: Gofakeit v6. Now supports concurrency and crypto/rand | news.ycombinator.com | 2021-01-18

    The new v6 release now supports the ability to have localized rand that allows for concurrent generating of random data. Crypto/rand has now been integrated as well as if you have a custom rand that fulfills the math/rand source64 interface you can use it with gofakeit. Let me know your thoughts.


  • GitHub repo kea

    Production Ready State Management for React

  • GitHub repo Kiba

    Data processing & ETL framework for Ruby

    Project mention: Ruby 3.0.0 Released | news.ycombinator.com | 2020-12-24

    in ruby you have the async gems from https://github.com/socketry/async

    i dont know for data manipulation but i guess also ruby has it. maybe this https://github.com/thbar/kiba

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-03-02.


What are some of the best open-source Data projects? This list will help you:

Project Stars
1 SheetJS js-xlsx 24,750
2 Metabase 23,964
3 react-query 17,949
4 data 14,562
5 awesome-bigdata 9,731
6 semantic-source 7,943
7 OpenRefine 7,905
8 Countly 4,689
9 Bogus 4,169
10 Tabulator 3,830
11 react-refetch 3,387
12 Mimesis 3,234
13 akshare 3,120
14 CKAN 2,902
15 awesome-json-datasets 2,232
16 Scio 2,092
17 Stats 1,922
18 PyFunctional 1,790
19 airbyte 1,755
20 lens 1,736
21 gofakeit 1,699
22 kea 1,698
23 Kiba 1,544