Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Java Spark Projects
-
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
-
Alluxio (formerly Tachyon)
Alluxio, data orchestration for analytics and machine learning in the cloud
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.
-
RoaringBitmap
A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others
Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org
-
linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
-
paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Project mention: Apache iceberg the Hadoop of the modern-data-stack? | news.ycombinator.com | 2025-03-06 -
Nutrient
Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
-
LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
-
-
Project mention: Polaris Catalog: An Open Source Catalog for Apache Iceberg | news.ycombinator.com | 2024-06-03
-
kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
-
-
-
Project mention: Apache Uniffle: high performance, general purpose remote shuffle service | news.ycombinator.com | 2024-03-19
-
spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
-
dataCompare
big data comparison and data profiling platform: low code,data comparison and data profiling
-
rumble
⛈️ RumbleDB 1.22.0 "Pyrenean oak" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)
-
-
-
big-data-pipeline-lambda-arch
A full big data pipeline (Lambda Architecture) with Spark, Kafka, HDFS and Cassandra.
-
hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
-
-
squashql
Official repository of SquashQL, the SQL query engine for multi-dimensional and hierarchical analysis that empowers your SQL database
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Java Spark discussion
Java Spark related posts
-
Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead
-
Apache Zeppelin
-
Polaris Catalog: An Open Source Catalog for Apache Iceberg
-
Apache Uniffle: high performance, general purpose remote shuffle service
-
A deep dive into the concept and world of Apache Iceberg Catalogs
-
Five Apache projects you probably didn't know about
-
Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 14 Mar 2025
Index
What are some of the best open-source Spark projects in Java? This list will help you:
# | Project | Stars |
---|---|---|
1 | Deeplearning4j | 13,847 |
2 | doris | 13,286 |
3 | Alluxio (formerly Tachyon) | 6,944 |
4 | Zeppelin | 6,467 |
5 | RoaringBitmap | 3,631 |
6 | linkis | 3,343 |
7 | paimon | 2,669 |
8 | LakeSoul | 2,629 |
9 | elassandra | 1,714 |
10 | nessie | 1,154 |
11 | kylo | 1,111 |
12 | zingg | 984 |
13 | Sparkler | 411 |
14 | uniffle | 402 |
15 | spark-bigquery-connector | 388 |
16 | dataCompare | 261 |
17 | rumble | 222 |
18 | incubator-wayang | 219 |
19 | batch-processing-gateway | 186 |
20 | big-data-pipeline-lambda-arch | 176 |
21 | hadoopcryptoledger | 139 |
22 | lighter | 95 |
23 | squashql | 53 |