Top 23 Big Data Open-Source Projects

awesome-scalability

6 53,036 6.3

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Apache Spark

101 38,320 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ClickHouse

208 34,153 10.0 C++

ClickHouse® is a free analytics DBMS for big data

Project mention: We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions | news.ycombinator.com | 2024-04-02

Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864
Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...

data-science-ipython-notebooks

1 26,459 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Apache Flink

9 23,158 9.9 Java

Apache Flink

Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

7. Apache Flink | Github | tutorial

gun

247 17,784 7.2 JavaScript

An open source cybersecurity protocol for syncing decentralized graph data.

Project mention: gun: NEW Data - star count:17470.0 | /r/algoprojects | 2023-10-28

Presto

14 15,591 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
QuestDB

311 13,448 9.7 Java

An open source time-series database for fast ingest and SQL queries

Project mention: How to Forecast Air Temperatures with AI + IoT Sensor Data | dev.to | 2024-03-24

If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.

Cookbook

21 12,923 7.8

The Data Engineering Cookbook

Project mention: Tranzitie catre data engineering | /r/programare | 2023-07-12

https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy

kafka-manager

13 11,670 0.0 Scala

CMAK is a tool for managing Apache Kafka clusters

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17

NebulaGraph Database

8 10,114 8.1 C++

A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)
Trino

44 9,552 10.0 Java

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19

Cython

79 8,912 9.8 Python

The most widely used Python to C compiler

Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10

kafka-ui

47 8,458 8.5 Java

Open-Source Web UI for Apache Kafka Management

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17

starrocks

12 7,764 10.0 Java

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

catboost

8 7,744 9.9 Python

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05

beam

30 7,508 10.0 Java

Apache Beam is a unified programming model for Batch and Streaming data processing.

Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

delta

69 6,874 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io

H2O

10 6,730 9.7 Jupyter Notebook

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

risingwave

27 6,283 10.0 Rust

Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.

Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30

How does this compare to RisingWave and Materialize?
https://github.com/risingwavelabs/risingwave

Zeppelin

8 6,261 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

arkime

13 6,114 9.6 JavaScript

Arkime is an open source, large scale, full packet capturing, indexing, and database system.
pachyderm

8 6,074 9.8 Go

Data-Centric Pipelines and Data Versioning

Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

20. Pachyderm | Github | tutorial

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Big Data related posts

Top 10 Common Data Engineers and Scientists Pain Points in 2024
1 project | dev.to | 11 Apr 2024
Velox: Meta's Unified Execution Engine [pdf]
2 projects | news.ycombinator.com | 25 Mar 2024
Linkedin OpenHouse: Control Plane for Tables in Data Lakehouses
1 project | news.ycombinator.com | 11 Mar 2024
Fair Benchmarking Considered Difficult (2018) [pdf]
2 projects | news.ycombinator.com | 10 Mar 2024
Choosing Between a Streaming Database and a Stream Processing Framework in Python
10 projects | dev.to | 10 Feb 2024
ClickBench – A Benchmark for Analytical DBMS
1 project | news.ycombinator.com | 8 Feb 2024
Why Postgres RDS didn't work for us
4 projects | news.ycombinator.com | 3 Feb 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Big Data projects? This list will help you:

	Project	Stars
1	awesome-scalability	53,036
2	Apache Spark	38,320
3	ClickHouse	34,153
4	data-science-ipython-notebooks	26,459
5	Apache Flink	23,158
6	gun	17,784
7	Presto	15,591
8	QuestDB	13,448
9	Cookbook	12,923
10	kafka-manager	11,670
11	NebulaGraph Database	10,114
12	Trino	9,552
13	Cython	8,912
14	kafka-ui	8,458
15	starrocks	7,764
16	catboost	7,744
17	beam	7,508
18	delta	6,874
19	H2O	6,730
20	risingwave	6,283
21	Zeppelin	6,261
22	arkime	6,114
23	pachyderm	6,074