Top 23 Java Big Data Projects

Apache Flink

9 23,128 9.9 Java

Apache Flink

Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

7. Apache Flink | Github | tutorial
Presto

14 15,582 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
QuestDB

311 13,420 9.7 Java

An open source time-series database for fast ingest and SQL queries

Project mention: How to Forecast Air Temperatures with AI + IoT Sensor Data | dev.to | 2024-03-24

If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.
Trino

44 9,519 10.0 Java

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19
kafka-ui

47 8,416 8.9 Java

Open-Source Web UI for Apache Kafka Management

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
starrocks

12 7,726 10.0 Java

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
beam

30 7,477 10.0 Java

Apache Beam is a unified programming model for Batch and Streaming data processing.

Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Zeppelin

8 6,261 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.
Hazelcast

7 5,853 9.9 Java

Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

Project mention: Does anyone know any good java implementations for distributed key-value store? | /r/ExperiencedDevs | 2023-06-08

You're probably looking for Hazelcast here. Note that it does much more than just a distributed k/v, but it will get you where you need to go.
vespa

4 5,323 10.0 Java

AI + Data, online. https://vespa.ai

Project mention: Top 10 Best Vector Databases & Libraries | dev.to | 2023-04-19

Vespa(4.3k ⭐) → A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows you to apply AI to make sense of your data in real time.
Apache Hive

14 5,320 9.6 Java

Apache Hive
Apache Ignite

3 4,675 9.6 Java

Apache Ignite (by apache)
Apache Calcite

28 4,344 9.0 Java

Apache Calcite

Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | news.ycombinator.com | 2023-07-26

> Make diff work on more than just SQLite.
Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).
[0] https://calcite.apache.org/
iotdb

3 4,249 9.9 Java

Apache IoTDB
Crate

6 3,952 9.9 Java

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01
fastjson2

2 3,411 9.8 Java

🚄 FASTJSON2 is a Java JSON library with excellent performance.

Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20
Flume

2 2,503 5.0 Java

Mirror of Apache Flume
Apache Parquet

4 2,398 9.2 Java

Apache Parquet
LakeSoul

21 2,294 9.3 Java

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Apache Drill

9 1,891 8.2 Java

Apache Drill is a distributed MPP query layer for self describing data (by apache)

Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30

Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.
bookkeeper

3 1,845 9.1 Java

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
paimon

1 1,792 9.9 Java

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

18. Apache Paimon | Github | tutorial
parquet-format

4 1,633 7.4 Java

Apache Parquet

Project mention: Summing columns in remote Parquet files using DuckDB | news.ycombinator.com | 2023-11-16

Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-09.

Java Big Data related posts

Top 10 Common Data Engineers and Scientists Pain Points in 2024
1 project | dev.to | 11 Apr 2024
Linkedin OpenHouse: Control Plane for Tables in Data Lakehouses
1 project | news.ycombinator.com | 11 Mar 2024
Choosing Between a Streaming Database and a Stream Processing Framework in Python
10 projects | dev.to | 10 Feb 2024
Ask HN: Does (or why does) anyone use MapReduce anymore?
2 projects | news.ycombinator.com | 24 Jan 2024
StarRocks – sub-second MPP OLAP database for full analytics scenarios
1 project | news.ycombinator.com | 23 Jan 2024
Five Apache projects you probably didn't know about
8 projects | dev.to | 21 Dec 2023
Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog
4 projects | dev.to | 18 Dec 2023
A note from our sponsor - SaaSHub
www.saashub.com | 18 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Big Data projects in Java? This list will help you:

	Project	Stars
1	Apache Flink	23,128
2	Presto	15,582
3	QuestDB	13,420
4	Trino	9,519
5	kafka-ui	8,416
6	starrocks	7,726
7	beam	7,477
8	Zeppelin	6,261
9	Hazelcast	5,853
10	vespa	5,323
11	Apache Hive	5,320
12	Apache Ignite	4,675
13	Apache Calcite	4,344
14	iotdb	4,249
15	Crate	3,952
16	fastjson2	3,411
17	Flume	2,503
18	Apache Parquet	2,398
19	LakeSoul	2,294
20	Apache Drill	1,891
21	bookkeeper	1,845
22	paimon	1,792
23	parquet-format	1,633