Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression. Learn more →
Top 23 Data Open-Source Projects
-
TanStack Query
🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.
Yeah the post links to this issue: https://github.com/TanStack/query/issues/4933. While it does look like there are some small issues around server components, the bottom of the thread seems to be resolved, and it looks like this won't be an issue going forwards.
-
SheetJS js-xlsx
📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs
Project mention: What kind of Programmer / language should I be looking for? | /r/AskProgramming | 2023-05-04Sure. I manipulate excel files programatically in the browser all the time. I don't really understand your exact workflow, but I use Javascript with xlsx and React.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Metabase
The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:
Project mention: Ask HN: Open-Source Self-Hosted No-Code Platforms? | news.ycombinator.com | 2023-05-13The solution really depends on what sort of problems you are trying to solve and who your customers are.
There are a fair few low-code solutions out there for reporting and data visualisation that are great for finance and marketing teams for example. e.g. https://metabase.com/ , https://evidence.dev/
For multipurpose SMB workflows and organisational processes, I have used n8n in the recent past and found it was quite good and incredibly easy to maintain. https://n8n.io/engineering-resources/
For enterprise processes I'd go with Camunda (solely based on recommendations and not first hand experience). Although only parts of their platform are OSS https://github.com/camunda
Bear in mind that some of these are not suitable if you want to build something that competes with them while taking their OSS code. But are perfectly fine otherwise.
-
-
Project mention: [Effortpost] Advanced stats on which players are contributing the most to the Heat's playoff run. | /r/heat | 2023-05-24
To answer these questions I decided to look at 538’s RAPTOR ratings. RAPTOR uses player tracking data to estimate how much each player contributes on the offensive and defensive ends. The total RAPTOR score should be something like the “number of points a player contributes to his team’s offense and defense per 100 possessions, relative to a league-average player.” Higher is better, best during the regular season has been Nikola Jokic at +14. You can read more about it here or play with an interactive tool on their website here. I don’t really care about the details of why it’s a good statistic, but it seems pretty helpful and most importantly for my purposes you can download the data here for free.
-
-
Project mention: GIS data for a project. I apologize for the banality of my request and for my English. | /r/datasets | 2023-03-15
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
airbyte
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Project mention: airbyte VS cloudquery - a user suggested alternative | libhunt.com/r/airbyte | 2023-06-02 -
-
Project mention: Jsonformer: A bulletproof way to generate structured output from LLMs | news.ycombinator.com | 2023-05-02
Something like this should be integrated with library like https://fakerjs.dev/
-
ah, easy. it's because support has not been added into https://github.com/github/semantic which is the tech that powers the GitHub UI. Adding support is pretty easy/mainly glue code [1] that imports the tree sitter API.
[1] https://github.com/github/semantic/blob/793a876ae45d38a6bd17...
-
Bogus
:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.
Project mention: Best practices for organising Mock Data & Repositories in Testing | /r/dotnet | 2023-06-02To do all this you need test though...and that's where ]Bogus](https://github.com/bchavez/Bogus) and Auto Bogus come in handy. They both generate semi random test data that you can use to populate whatever method you've decided to use. You can setup rules so for specific fields, they have built in generators for common things like names and addresses. Auto Bogus can be used to populate large/complicated objects with data automatically (it can be slow if you don't use .WithRecursiveDepth() or.WithTreeDepth() )
-
prql
PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
Project mention: Machine Learning, Linear Algebra, and More: Is SQL All You Need? | /r/ProgrammingLanguages | 2023-05-20Compared to PRQL that someone commented on in the comments, the syntax on that one is a bit hairy
-
Snowplow
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
Project mention: Open-source data collection & modeling platform for product analytics | dev.to | 2022-08-30We’ve also thought about Ops :-). There’s a backend 'Collector' that stores data in Postgres, for instance to use while developing locally, or if you want to get set up quickly. But there’s also full integration with Snowplow, which works seamlessly with an existing Snowplow setup as well.
-
machine-learning-roadmap
A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.
7. Machine Learning Roadmap
-
Project mention: Can you recommend a React table components with virtualization and array indices (instead of key-value pairs)? | /r/reactjs | 2023-01-22
We use Tabulator for a similar use case at my job, it has a virtual DOM to support thousands of rows pretty well. It's not React based, it's a vanilla JS library. There is a React wrapper here but I don't have experience with that.
-
knowledge-repo
A next-generation curated knowledge sharing platform for data scientists and other technical professions.
While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like https://github.com/airbnb/knowledge-repo provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.
-
Project mention: PDF GPT allows you to chat with the contents of your PDF file | /r/ChatGPTPro | 2023-04-12
I would check out https://github.com/Unstructured-IO/unstructured (what lang chain uses) or https://github.com/axa-group/Parsr (probably what unstructured copied to get their startup off the ground lol)
-
Countly
Countly is a product analytics platform that helps teams track, analyze and act-on their user actions and behaviour on mobile, web and desktop applications.
I suggest you use Countly, they have a free edition (Community) also its Self-hosted.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Project mention: Data sources episode 2: AWS S3 to Postgres Data Sync using Singer | dev.to | 2023-05-04Link to original blog: https://www.mage.ai/blog/data-sources-ep-2-aws-s3-to-postgres-data-sync-using-singer
-
-
browser-compat-data
This repository contains compatibility data for Web technologies as displayed on MDN
Yes, Mozzila’s browser-compat-data (https://github.com/mdn/browser-compat-data) is the authoritative source.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Data related posts
- dg - a fast relational data generator
- Release v0.3.0 · ArroyoSystems/arroyo - Stream Processing Engine
- Best practices for organising Mock Data & Repositories in Testing
- Tigris Standalone Search in beta with demo of email search with Resend
- Will RedditExtractoR be impacted by API changes?
- Building a Fast, Low-Latency Cache with Dozer and PostgreSQL
- Postgres to Postgres incremental (timestamp) sync etl tool
-
A note from our sponsor - InfluxDB
www.influxdata.com | 5 Jun 2023
Index
What are some of the best open-source Data projects? This list will help you:
Project | Stars | |
---|---|---|
1 | TanStack Query | 34,808 |
2 | SheetJS js-xlsx | 32,942 |
3 | Metabase | 32,584 |
4 | SWR | 26,864 |
5 | data | 16,226 |
6 | Prefect | 12,078 |
7 | awesome-bigdata | 12,040 |
8 | airbyte | 10,796 |
9 | chinese-xinhua | 10,089 |
10 | faker | 9,551 |
11 | semantic-source | 8,680 |
12 | Bogus | 7,287 |
13 | prql | 7,172 |
14 | akshare | 6,578 |
15 | Snowplow | 6,473 |
16 | machine-learning-roadmap | 6,434 |
17 | Tabulator | 5,458 |
18 | knowledge-repo | 5,332 |
19 | Parsr | 5,301 |
20 | Countly | 5,167 |
21 | Mage | 4,778 |
22 | altair | 4,711 |
23 | browser-compat-data | 4,503 |