Top 23 Distributed Open-Source Projects

  • tensorflow

    An Open Source Machine Learning Framework for Everyone

    Project mention: Non-determinism in GPT-4 is caused by Sparse MoE | news.ycombinator.com | 2023-08-04

    Right but that's not an inherent GPU determinism issue. It's a software issue.

    https://github.com/tensorflow/tensorflow/issues/3103#issueco... is correct that it's not necessary, it's a choice.

    Your line of reasoning appears to be "GPUs are inherently non-deterministic don't be quick to judge someone's code" which as far as I can tell is dead wrong.

    Admittedly there are some cases and instructions that may result in non-determinism but they are inherently necessary. The author should thinking carefully before introducing non-determinism. There are many scenarios where it is irrelevant, but ultimately the issue we are discussing here isn't the GPU's fault.

  • Ray

    Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

    Project mention: Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models | news.ycombinator.com | 2023-08-11

    Training times for GSM8k are mentioned here: https://github.com/ray-project/ray/tree/master/doc/source/te...

  • SonarCloud

    Analyze your C and C++ projects with just one click.. SonarCloud, a cloud-based static analysis tool for your CI/CD workflows, offers a one-click automatic analysis of C and C++ projects hosted on GitHub. Zero configuration and free for open-source projects! Analyze free.

  • handson-ml

    ⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.

  • Nextcloud

    ☁️ Nextcloud server, a safe home for all your data

    Project mention: Your privacy is optional | dev.to | 2023-09-19

    NextCloud - Once I have my Unraid NAS up and running I will be setting up NextCloud for the whole family. This way I can get my unencrypted files and photos off of services such as Dropbox and iCloud.

  • surrealdb

    A scalable, distributed, collaborative, document-graph database, for the realtime web

    Project mention: How to Design a SurrealDB schema and create a basic client for TypeScript | dev.to | 2023-09-17

    In the midst of a dynamic landscape of exciting new projects, one name shines bright — SurrealDB.

  • Redisson

    Redisson - Easy Redis Java client with features of In-Memory Data Grid. Sync/Async/RxJava/Reactive API. Over 50 Redis based Java objects and services: Set, Multimap, SortedSet, Map, List, Queue, Deque, Semaphore, Lock, AtomicLong, Map Reduce, Bloom filter, Spring Cache, Tomcat, Scheduler, JCache API, Hibernate, RPC, local cache ...

    Project mention: Kotlin Spring WebFlux, R2DBC and Redisson microservice in k8s 👋✨💫 | dev.to | 2022-10-17

    Source code you can find in the GitHub repository. he main idea of this project is the implementation of microservice using Kotlin, Spring WebFlux, PostgresSQL, and Redis with metrics and monitoring and deploying it to k8s. For interacting with PostgresSQL we will use reactive Spring Data R2DBC and for Redis caching using Redisson.

  • TDengine

    TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.

    Project mention: TDengine: NEW Data - star count:21596.0 | /r/algoprojects | 2023-08-06
  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

  • Phoenix

    Peace of mind from prototype to production

    Project mention: Emoji Generator with AI | news.ycombinator.com | 2023-09-08

    Yes! I love Elixir :) [Phoenix LiveView](https://www.phoenixframework.org/) is really amazing. I feel so fast working in it. I got hooked after watching Chris McCord's ['Build a real-time Twitter clone in 15 minutes'](https://www.youtube.com/watch?v=MZvmYaFkNJI&embeds_referring...), and things have improved a lot since then.

  • dgraph

    The high-performance database for modern applications

    Project mention: Is Dgraph dead? (should I continue using it) | news.ycombinator.com | 2023-09-18
  • CNTK

    Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

  • Bit

    A tool for composable software development.

    Project mention: React monorepo with open-source apps and proprietary libs | /r/react | 2023-07-19

    Oh can I address theses issues. I already looked at tools like Nx or Bit, but they aren't matching our needs with closed source libs.

  • LightGBM

    A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

    Project mention: SIRUS.jl: Interpretable Machine Learning via Rule Extraction | /r/Julia | 2023-06-29

    SIRUS.jl is a pure Julia implementation of the SIRUS algorithm by Bénard et al. (2021). The algorithm is a rule-based machine learning model meaning that it is fully interpretable. The algorithm does this by firstly fitting a random forests and then converting this forest to rules. Furthermore, the algorithm is stable and achieves a predictive performance that is comparable to LightGBM, a state-of-the-art gradient boosting model created by Microsoft. Interpretability, stability, and predictive performance are described in more detail below.

  • diaspora*

    A privacy-aware, distributed, open source social network.

    Project mention: We need a Facebook groups style decentralized alternative. Does one exist? | /r/selfhosted | 2023-07-06
  • nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

    Project mention: Filter Pruning for PyTorch | /r/deeplearning | 2023-04-13
  • NebulaGraph Database

    A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)

    Project mention: What is a NoSQL Graph Database? | dev.to | 2023-01-09

    A NoSQL graph database is a type of non-relational, distributed database which employs a graph model. NoSQL stands for “Not only SQL” and refers to a new breed of databases that differ from traditional relational databases in their data model and performance. Graph databases are especially useful for data associated with relationships—everything from friendships on social netwo#rks to equipment supply chains or business processes. They can quickly traverse vast amounts of linked data points to discover insights and hidden connections between entities, making them ideal for network analysis– such as financial fraud detection, recommendation engines and many other use cases– all while performing at scale.

  • modin

    Modin: Scale your Pandas workflows by changing a single line of code

    Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15
  • optuna

    A hyperparameter optimization framework

    Project mention: FOSS hyperparameter optimization framework to automate hyperparameter search | news.ycombinator.com | 2023-08-10
  • orbitdb

    Peer-to-Peer Databases for the Decentralized Web

    Project mention: OrbitDB reaches version 1.0 after 8 years of development | news.ycombinator.com | 2023-09-19
  • H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

    Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

    I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

  • oceanbase

    OceanBase is an enterprise distributed relational database with high availability, high performance, horizontal scalability, and compatibility with SQL standards.

    Project mention: Show HN: OceanBase – An open-source distributed SQL database written in C++ | news.ycombinator.com | 2023-05-23
  • PowerJob

    Enterprise job scheduling middleware with distributed computing ability.

  • Hazelcast

    Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

    Project mention: Does anyone know any good java implementations for distributed key-value store? | /r/ExperiencedDevs | 2023-06-08

    You're probably looking for Hazelcast here. Note that it does much more than just a distributed k/v, but it will get you where you need to go.

  • scrapy-redis

    Redis-based components for Scrapy.

    Project mention: How to make scrapy run multiple times on the same URLs? | /r/scrapy | 2023-06-26
  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-19.

Distributed related posts


What are some of the best open-source Distributed projects? This list will help you:

Project Stars
1 tensorflow 177,728
2 Ray 27,697
3 handson-ml 25,038
4 Nextcloud 23,799
5 surrealdb 22,480
6 Redisson 21,760
7 TDengine 21,751
8 Phoenix 19,935
9 dgraph 19,611
10 CNTK 17,405
11 Bit 16,997
12 LightGBM 15,464
13 diaspora* 13,288
14 nni 13,270
15 NebulaGraph Database 9,474
16 modin 8,967
17 optuna 8,639
18 orbitdb 7,811
19 H2O 6,484
20 oceanbase 6,162
21 PowerJob 5,811
22 Hazelcast 5,556
23 scrapy-redis 5,338
Collect and Analyze Billions of Data Points in Real Time
Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.