Big Data

Open-source projects categorized as Big Data

Top 23 Big Data Open-Source Projects

  1. awesome-scalability

    The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

    Project mention: The Patterns of Scalable, Reliable, and Performant Large-Scale Systems | news.ycombinator.com | 2024-12-19
  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom | dev.to | 2025-03-11

    One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.

  4. ClickHouse

    ClickHouse® is a real-time analytics database management system

    Project mention: Exposing concurrency bugs with a custom scheduler | news.ycombinator.com | 2025-02-14

    It is possible to do this entirely in userspace without a custom scheduler.

    See the implementation here: https://github.com/ClickHouse/ClickHouse/blob/master/src/Com...

    It works and makes significant improvements for the detection of concurrency bugs, including complex logical races in distributed scenarios.

  5. data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  6. gun

    An open source cybersecurity protocol for syncing decentralized graph data.

  7. Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Using IRIS and Presto for high-performance and scalable SQL queries | dev.to | 2025-01-19

    The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.

  8. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  9. Cookbook

    The Data Engineering Cookbook

  10. kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  11. NebulaGraph Database

    A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)

  12. Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

    Project mention: Apache Iceberg | news.ycombinator.com | 2025-01-25
  13. kafka-ui

    Open-Source Web UI for Apache Kafka Management

    Project mention: How to Get Remote Code Execution in Kafka UI | news.ycombinator.com | 2024-07-22
  14. Cython

    The most widely used Python to C compiler

    Project mention: I Use Nim Instead of Python for Data Processing | news.ycombinator.com | 2024-09-05
  15. quickwit

    Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

    Project mention: Quickwit Joins Datadog | news.ycombinator.com | 2025-01-09
  16. starrocks

    The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

    Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

    tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb

    Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

  17. catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Project mention: 🚀 Why Your ML Service Needs Rust + CatBoost: A Setup Guide That Actually Works | dev.to | 2025-01-19

    [package] name = "MLApp" version = "0.1.0" edition = "2021" [dependencies] catboost = { git = "https://github.com/catboost/catboost", rev = "0bfdc35"}

  18. beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

    Project mention: No SNAPSHOTs | dev.to | 2024-07-30

    Even ASF does not use Maven to build some of its projects anymore: Beam, Groovy, Lucene, Geode, POI, and Solr are not built with Maven. Those are not the most popular ASF projects, I know, but still, it is something.

  19. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27

    When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.

  20. risingwave

    Stream processing and management platform.

    Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13

    Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.

  21. H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  22. datafusion

    Apache DataFusion SQL Query Engine

    Project mention: Ask HN: Who wants to be hired? (February 2025) | news.ycombinator.com | 2025-02-03

    Remote: Yes

    Willing to relocate: Yes

    Technologies: Rust, Nodejs, Javascript, Typescript, Golang

    Résumé/CV: https://drive.google.com/drive/folders/1ecTn700lcmt8cqlnBTtm...

    Email: [email protected]

    Github: https://github.com/jatin510

    Info: Hi, I'm Jagdish Parihar! A Backend Engineer with 4+ years of experience building scalable systems and microservices using Rust, Node.js, and Golang. I've contributed to open-source projects like Apache DataFusion and thrive on solving complex backend challenges.

    I'm exploring the opportunity to be working in the DB based startups. I am looking for an entry to be an engineer who will work on databases. Currently, I am contributing to open source, looking for part-time/full-time working with databases.

    Datafusion contributions: https://github.com/apache/datafusion/pulls?q=is%3Apr+author%...

    Datafusion comet contributions: https://github.com/apache/datafusion-comet/pulls?q=is%3Apr+a...

    Let’s connect!

  23. paradedb

    Postgres for Search and Analytics

    Project mention: BM25 in PostgreSQL – 3x Faster Than Elasticsearch | news.ycombinator.com | 2025-03-02

    Any comparison results in terms of performance vs accuracy with: https://github.com/paradedb/paradedb/tree/dev/pg_search

  24. arkime

    Arkime is an open source, large scale, full packet capturing, indexing, and database system.

  25. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Big Data discussion

Log in or Post with

Big Data related posts

  • Show HN: Kafbat UI for Apache Kafka v1.2 is out

    1 project | news.ycombinator.com | 21 Mar 2025
  • Beginner’s Guide to Contributing to GitHub Open Source Projects

    2 projects | dev.to | 21 Mar 2025
  • Show HN: OpenTimes – Free travel times between U.S. Census geographies

    3 projects | news.ycombinator.com | 17 Mar 2025
  • Show HN: Hydra – serverless realtime analytics on Postgres

    1 project | news.ycombinator.com | 12 Mar 2025
  • Exploring the Power and Community Behind Apache Flink

    2 projects | dev.to | 6 Mar 2025
  • The two versions of Parquet

    3 projects | dev.to | 19 Feb 2025
  • Ask HN: Going beyond Pandas for analysis, how to stay sane?

    1 project | news.ycombinator.com | 14 Feb 2025
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 22 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Big Data projects? This list will help you:

# Project Stars
1 awesome-scalability 60,961
2 Apache Spark 40,735
3 ClickHouse 39,645
4 data-science-ipython-notebooks 27,993
5 Apache Flink 24,649
6 gun 18,336
7 Presto 16,272
8 Cookbook 14,134
9 kafka-manager 11,862
10 NebulaGraph Database 11,149
11 Trino 10,996
12 kafka-ui 10,482
13 Cython 9,861
14 quickwit 9,805
15 starrocks 9,701
16 catboost 8,300
17 beam 8,040
18 delta 7,892
19 risingwave 7,534
20 H2O 7,072
21 datafusion 6,912
22 paradedb 6,841
23 arkime 6,557

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Java is
the 8th most popular programming language
based on number of references?