Top 23 Java Spark Projects

Deeplearning4j

13 13,424 6.5 Java

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29

doris

42 11,314 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Alluxio (formerly Tachyon)

0 6,631 9.7 Java

Alluxio, data orchestration for analytics and machine learning in the cloud
Zeppelin

8 6,263 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

RoaringBitmap

24 3,388 8.5 Java

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

Project mention: Iterating over Bit Sets Quickly | news.ycombinator.com | 2024-02-24

I was recently reading about Roaring https://roaringbitmap.org/ which is a highly optimized compressed bitset implementation. I reccomend reading about it if you are interested in this sort of thing. The talk at https://roaringbitmap.org/talks/ is especially good.

linkis

2 3,227 9.5 Java

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
LakeSoul

21 2,301 9.3 Java

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
paimon

1 1,907 9.9 Java

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

18. Apache Paimon | Github | tutorial

elassandra

1 1,708 0.0 Java

Elassandra = Elasticsearch + Apache Cassandra
kylo

1 1,091 10.0 Java

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09

https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:

zingg

23 877 9.3 Java

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
nessie

13 831 9.9 Java

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

Project mention: A deep dive into the concept and world of Apache Iceberg Catalogs | dev.to | 2024-03-01

Nessie is an innovative open-source catalog that extends beyond the traditional catalog capabilities in the Apache Iceberg ecosystem, introducing git-like features to data management. This catalog not only tracks table metadata but also allows users to capture commits at a holistic level, enabling advanced operations such as multi-table transactions, rollbacks, branching, and tagging. These features provide a new layer of flexibility and control over data changes, resembling version control systems in software development.

Sparkler

0 409 3.0 Java

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
incubator-uniffle

3 354 9.5 Java

Uniffle is a high performance, general purpose Remote Shuffle Service.

Project mention: Apache Uniffle: high performance, general purpose remote shuffle service | news.ycombinator.com | 2024-03-19

spark-bigquery-connector

2 348 9.0 Java

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
dataCompare

1 234 3.7 Java

big data comparison and data profiling platform: low code，data comparison and data profiling
rumble

1 207 7.8 Java

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)
batch-processing-gateway

1 167 5.6 Java

The gateway component to make Spark on K8s much easier for Spark users.
incubator-wayang

18 167 9.4 Java

Apache Wayang(incubating) is the first cross-platform data processing system.

Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05

big-data-pipeline-lambda-arch

1 161 1.5 Java

A full big data pipeline (Lambda Architecture) with Spark, Kafka, HDFS and Cassandra.
hadoopcryptoledger

7 141 1.8 Java

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
lighter

2 79 9.7 Java

REST API for Apache Spark on K8S or YARN
squashql

3 42 9.3 Java

Official repository of SquashQL, the SQL query engine for multi-dimensional and hierarchical analysis that empowers your SQL database

Project mention: Show HN: SQL Polyglot | news.ycombinator.com | 2023-12-16

I am building a SQL like query engine with a typescript query builder. It aims to provide multidimensional query capabilities (think about olap cube). The query engine is compatible with several databases (sucks, clickhouse, bigqueryx, snowflake...). Our users are developers building decision making applications. They are using for instance duckdb to develop locally their application and use a cloud provider in production. Here's the link https://github.com/squashql/squashql. Feel free to ask me any questions.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Java Spark related posts

Apache Uniffle: high performance, general purpose remote shuffle service
1 project | news.ycombinator.com | 19 Mar 2024
A deep dive into the concept and world of Apache Iceberg Catalogs
1 project | dev.to | 1 Mar 2024
Five Apache projects you probably didn't know about
8 projects | dev.to | 21 Dec 2023
Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog
4 projects | dev.to | 18 Dec 2023
Apache Uniffle: a high performance remote shuffle service for Spark
1 project | news.ycombinator.com | 2 Oct 2023
Why is Hive Metastore everywhere? (Especially Iceberg)
1 project | /r/dataengineering | 30 Jun 2023
What is the best approach to removing duplicate person records if the only identifier is person firstname middle name and last name? These names are entered in varying ways to the DB, thus they are free-fromatted.
2 projects | /r/SQL | 25 Mar 2023
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Spark projects in Java? This list will help you:

	Project	Stars
1	Deeplearning4j	13,424
2	doris	11,314
3	Alluxio (formerly Tachyon)	6,631
4	Zeppelin	6,263
5	RoaringBitmap	3,388
6	linkis	3,227
7	LakeSoul	2,301
8	paimon	1,907
9	elassandra	1,708
10	kylo	1,091
11	zingg	877
12	nessie	831
13	Sparkler	409
14	incubator-uniffle	354
15	spark-bigquery-connector	348
16	dataCompare	234
17	rumble	207
18	batch-processing-gateway	167
19	incubator-wayang	167
20	big-data-pipeline-lambda-arch	161
21	hadoopcryptoledger	141
22	lighter	79
23	squashql	42