Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Big Data Open-Source Projects
-
Project mention: The Patterns of Scalable, Reliable, and Performant Large-Scale Systems | news.ycombinator.com | 2024-12-19
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Project mention: Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom | dev.to | 2025-03-11
One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.
-
Project mention: Exposing concurrency bugs with a custom scheduler | news.ycombinator.com | 2025-02-14
It is possible to do this entirely in userspace without a custom scheduler.
See the implementation here: https://github.com/ClickHouse/ClickHouse/blob/master/src/Com...
It works and makes significant improvements for the detection of concurrency bugs, including complex logical races in distributed scenarios.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
In conclusion, Apache Flink is more than a big data processing tool—it is a thriving ecosystem that exemplifies the power of open source collaboration. From its impressive technical capabilities to its innovative funding model, Apache Flink shows that sustainable software development is possible when community, corporate support, and transparency converge. As industries continue to demand efficient real-time data processing solutions, the future looks bright for Apache Flink. Whether you’re a developer, business analyst, or technology enthusiast, understanding the dynamics behind Apache Flink provides valuable insights into the evolving landscape of open source software. For further exploration of this subject, visit the official Apache Flink website or explore the comprehensive details hosted by the Apache Software Foundation. Stay tuned for more articles that delve into how open source models are shaping the future of technology. Happy coding!
-
-
Project mention: Using IRIS and Presto for high-performance and scalable SQL queries | dev.to | 2025-01-19
The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
NebulaGraph Database
A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)
-
-
-
Project mention: I Use Nim Instead of Python for Data Processing | news.ycombinator.com | 2024-09-05
-
quickwit
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
-
starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Project mention: 🚀 Why Your ML Service Needs Rust + CatBoost: A Setup Guide That Actually Works | dev.to | 2025-01-19[package] name = "MLApp" version = "0.1.0" edition = "2021" [dependencies] catboost = { git = "https://github.com/catboost/catboost", rev = "0bfdc35"}
-
Even ASF does not use Maven to build some of its projects anymore: Beam, Groovy, Lucene, Geode, POI, and Solr are not built with Maven. Those are not the most popular ASF projects, I know, but still, it is something.
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.
-
Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13
Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
Remote: Yes
Willing to relocate: Yes
Technologies: Rust, Nodejs, Javascript, Typescript, Golang
Résumé/CV: https://drive.google.com/drive/folders/1ecTn700lcmt8cqlnBTtm...
Email: [email protected]
Github: https://github.com/jatin510
Info: Hi, I'm Jagdish Parihar! A Backend Engineer with 4+ years of experience building scalable systems and microservices using Rust, Node.js, and Golang. I've contributed to open-source projects like Apache DataFusion and thrive on solving complex backend challenges.
I'm exploring the opportunity to be working in the DB based startups. I am looking for an entry to be an engineer who will work on databases. Currently, I am contributing to open source, looking for part-time/full-time working with databases.
Datafusion contributions: https://github.com/apache/datafusion/pulls?q=is%3Apr+author%...
Datafusion comet contributions: https://github.com/apache/datafusion-comet/pulls?q=is%3Apr+a...
Let’s connect!
-
Project mention: BM25 in PostgreSQL – 3x Faster Than Elasticsearch | news.ycombinator.com | 2025-03-02
Any comparison results in terms of performance vs accuracy with: https://github.com/paradedb/paradedb/tree/dev/pg_search
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Big Data discussion
Big Data related posts
-
Show HN: Kafbat UI for Apache Kafka v1.2 is out
-
Beginner’s Guide to Contributing to GitHub Open Source Projects
-
Show HN: OpenTimes – Free travel times between U.S. Census geographies
-
Show HN: Hydra – serverless realtime analytics on Postgres
-
Exploring the Power and Community Behind Apache Flink
-
The two versions of Parquet
-
Ask HN: Going beyond Pandas for analysis, how to stay sane?
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 22 Mar 2025
Index
What are some of the best open-source Big Data projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | awesome-scalability | 60,961 |
2 | Apache Spark | 40,735 |
3 | ClickHouse | 39,645 |
4 | data-science-ipython-notebooks | 27,993 |
5 | Apache Flink | 24,649 |
6 | gun | 18,336 |
7 | Presto | 16,272 |
8 | Cookbook | 14,134 |
9 | kafka-manager | 11,862 |
10 | NebulaGraph Database | 11,149 |
11 | Trino | 10,996 |
12 | kafka-ui | 10,482 |
13 | Cython | 9,861 |
14 | quickwit | 9,805 |
15 | starrocks | 9,701 |
16 | catboost | 8,300 |
17 | beam | 8,040 |
18 | delta | 7,892 |
19 | risingwave | 7,534 |
20 | H2O | 7,072 |
21 | datafusion | 6,912 |
22 | paradedb | 6,841 |
23 | arkime | 6,557 |