Top 23 Data Open-Source Projects
:green_book: SheetJS Community Edition -- Spreadsheet Data ToolkitProject mention: Export to Excel | reddit.com/r/webdev | 2021-04-27
The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16
If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
⚛️ Hooks for fetching, caching and updating asynchronous data in ReactProject mention: My go-to React libraries for 2021 | dev.to | 2021-05-14
For those who already know it, you might say that it’s not like Redux. It doesn’t do the same thing. React Query is a data fetching library, it makes fetching, caching, synchronizing and updating server state easy.
Data and code behind the articles and graphics at FiveThirtyEightProject mention: Bundesliga CL Qualification Scenarios | reddit.com/r/soccer | 2021-05-12
These scenarios were automatically generated from the following data (from FiveThirtyEight), please DM me if you find any errors:
A curated list of awesome big data frameworks, ressources and other awesomeness.
OpenRefine is a free, open source power tool for working with messy data and improving itProject mention: Great tools to show ETL, Data Visualisation and maybe some Analytics to managers? | reddit.com/r/dataengineering | 2021-05-03
Interesting, how does that compare to OpenRefine (free software)?
Parsing, analyzing, and comparing source code across many languagesProject mention: Saw a Tweet about Haskell+Servant being replaced with NodeJS in a project due to compile times - will compile times ever get better? | reddit.com/r/haskell | 2021-02-23
Most languages’ compilers aren’t powerful enough for tradeoffs to enter the equation. GHC is an exception: the more you ask of the compiler, the more you have to pay in compile times. An example of a PR that gives up concision for compile time speed is here.
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Countly helps you get insights from your application. Available self-hosted or on private cloud.Project mention: Azure Notification Hub for On Premises (Recommendations) | reddit.com/r/dotnet | 2021-04-07
You could use something like https://count.ly/ for push notifications to ios/android
:card_index: A simple and sane fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.Project mention: String similarity search and fast LIKE operator using pg_trgm | dev.to | 2021-05-11
I inserted 10M rows of fake data generated by Bogus into the table. You can download the dump here.
Even tho I found Tabulator, I am having problems with it. Am I reading it right, that showing results with Tabulator is just meant for browsers and not for NodeJS in Visual Studio Code (like I would do it)?
A simple, declarative, and composable way to fetch data for React components
Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.Project mention: Mimesis is a fake data generator that can be used in Data Science for generating dummy datasets. | reddit.com/r/datascience | 2021-04-03
The Data Transfer Project makes it easy for people to transfer their data between online service providers. We are establishing a common framework, including data models and protocols, to enable direct transfer of data both into and out of participating online service providers.Project mention: Apple now lets you transfer your iCloud Photos to Google Photos | news.ycombinator.com | 2021-03-04
That's interesting, and it's the first I've heard of the project. Unfortunately, it doesn't look like Apple has yet released the adaptors to the open source project . As much as I'm not interested in having Apple copy my photos to Google, I am very interested in scripting my own offline backups without having to make space for Photos.app to store all my photos on my laptop's SSD. Hopefully the adaptors are added to the project soon.
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.Project mention: We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything! | reddit.com/r/datasets | 2021-03-08
We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)Project mention: We built a pi controlled hydroponics box that grows your plants 1.5x faster using ML | reddit.com/r/raspberry_pi | 2021-04-26
but it looks like none of your plants are supported by the plantvillage model, or do I understand something wrong? https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/plant_village.py#L57
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.Project mention: Anyone created a ETL pipeline from Facebook Graph API before ? | reddit.com/r/dataengineering | 2021-05-02
Thanks for the hevo.io spot, noticed it actually has page insights ! I'm actually aiming to use airbyte.io in the end and help contribute a source connector, but I guess I'll stick with hevo.io at first
A curated list of awesome JSON datasets that don't require authentication.
A Scala API for Apache Beam and Google Cloud Dataflow.Project mention: ELT, Data Pipeline | dev.to | 2021-01-01
To counter the above mentioned problem, we decided to move our data to a Pub/Sub based stream model, where we would continue to push data as it arrives. As fluentd is the primary tool being used in all our servers to gather data, rather than replacing it we leveraged its plugin architecture to use a plugin to stream data into a sink of our choosing. Initially our inclination was towards Google PubSub and Google Dataflow as our Data Scientists/Engineers use Big Query extensively and keeping the data in the same Cloud made sense. The inspiration of using these tools came from Spotify’s Event Delivery – The Road to the Cloud. We did the setup on one of our staging server with Google PubSub and Dataflow. Both didn't really work out for us as PubSub model requires a Subscriber to be available for the Topic a Publisher streams messages to, otherwise the messages are not stored. On top of it there was no way to see which messages are arriving. During this the weirdest thing that we encountered was that the Topic would be orphaned losing the subscribers when working with Dataflow. PubSub we might have managed to live with, the wall in our path was Dataflow. We started off with using SCIO from Spotify to work with Dataflow, there is a considerate lack of documentation over it and found the community to be very reserved on Github, something quite evident in the world of Scala for which they came up with a Code of Conduct for its user base to follow. Something that was required from Dataflow for us was to support batch write option to GCS, after trying our hand at Dataflow to no success to achieve that, Google's staff at StackOverflow were quite responsive and their response confirmed that it was something not available with Dataflow and streaming data to BigQuery, Datastore or Bigtable as a datastore was an option to use. The reason we didn't do that was to avoid high streaming cost to these services to store data, as majority of our jobs from the data team are based on batched hourly data. The initial proposal to the updated pipeline is shown below.
A well tested and comprehensive Golang statistics library package with no dependencies. (by montanaflynn)
Random fake data generator written in goProject mention: Creating a PDF With Go, Maroto & Gofakeit | dev.to | 2021-05-01
Using mock data is a great way to speed up the prototyping process. We will use the GoFakeIt package to create a little dummy data generator to insert into our PDF.
Python library for creating data pipelines with chain functional programmingProject mention: PyFunctional makes creating data pipelines easy by using chained functional operators | reddit.com/r/Python | 2021-03-31
Random data generator.Project mention: Looking for help setting up the statistics for an assignment | reddit.com/r/AskStatistics | 2021-03-29
I used generatedata.com to make up the stats, but am unsure what to do with them.
What are some of the best open-source Data projects? This list will help you: