Java Spark

Open-source Java projects categorized as Spark

Top 23 Java Spark Projects

  1. Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

    Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29
  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
  4. Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  5. Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Serverless Data Processing on AWS : AWS Project | dev.to | 2024-11-13

    To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.

  6. RoaringBitmap

    A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

    Project mention: Roaring Bitmap Compression | news.ycombinator.com | 2024-11-08

    Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org

  7. linkis

    Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

  8. paimon

    Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

    Project mention: Apache iceberg the Hadoop of the modern-data-stack? | news.ycombinator.com | 2025-03-06
  9. Nutrient

    Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.

    Nutrient logo
  10. LakeSoul

    LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

  11. elassandra

    Elassandra = Elasticsearch + Apache Cassandra

  12. nessie

    Nessie: Transactional Catalog for Data Lakes with Git-like semantics

    Project mention: Polaris Catalog: An Open Source Catalog for Apache Iceberg | news.ycombinator.com | 2024-06-03
  13. kylo

    Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

  14. zingg

    Scalable identity resolution, entity resolution, data mastering and deduplication using ML

  15. Sparkler

    Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

  16. uniffle

    Uniffle is a high performance, general purpose Remote Shuffle Service.

    Project mention: Apache Uniffle: high performance, general purpose remote shuffle service | news.ycombinator.com | 2024-03-19
  17. spark-bigquery-connector

    BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

  18. dataCompare

    big data comparison and data profiling platform: low code,data comparison and data profiling

  19. rumble

    ⛈️ RumbleDB 1.22.0 "Pyrenean oak" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)

  20. incubator-wayang

    Apache Wayang(incubating) is the first cross-platform data processing system.

    Project mention: Show HN: Apache Wayang supports now Kafka | news.ycombinator.com | 2024-11-04
  21. batch-processing-gateway

    The gateway component to make Spark on K8s much easier for Spark users.

  22. big-data-pipeline-lambda-arch

    A full big data pipeline (Lambda Architecture) with Spark, Kafka, HDFS and Cassandra.

  23. hadoopcryptoledger

    Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

  24. lighter

    REST API for Apache Spark on K8S or YARN

  25. squashql

    Official repository of SquashQL, the SQL query engine for multi-dimensional and hierarchical analysis that empowers your SQL database

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Java Spark discussion

Log in or Post with

Java Spark related posts

  • Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead

    7 projects | dev.to | 27 Jan 2025
  • Apache Zeppelin

    6 projects | news.ycombinator.com | 2 Sep 2024
  • Polaris Catalog: An Open Source Catalog for Apache Iceberg

    1 project | news.ycombinator.com | 3 Jun 2024
  • Apache Uniffle: high performance, general purpose remote shuffle service

    1 project | news.ycombinator.com | 19 Mar 2024
  • A deep dive into the concept and world of Apache Iceberg Catalogs

    1 project | dev.to | 1 Mar 2024
  • Five Apache projects you probably didn't know about

    8 projects | dev.to | 21 Dec 2023
  • Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog

    4 projects | dev.to | 18 Dec 2023
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 14 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Spark projects in Java? This list will help you:

# Project Stars
1 Deeplearning4j 13,847
2 doris 13,286
3 Alluxio (formerly Tachyon) 6,944
4 Zeppelin 6,467
5 RoaringBitmap 3,631
6 linkis 3,343
7 paimon 2,669
8 LakeSoul 2,629
9 elassandra 1,714
10 nessie 1,154
11 kylo 1,111
12 zingg 984
13 Sparkler 411
14 uniffle 402
15 spark-bigquery-connector 388
16 dataCompare 261
17 rumble 222
18 incubator-wayang 219
19 batch-processing-gateway 186
20 big-data-pipeline-lambda-arch 176
21 hadoopcryptoledger 139
22 lighter 95
23 squashql 53

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Java is
the 8th most popular programming language
based on number of references?