SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Java Big Data Projects
-
You can find example of usage in org/apache/flink/contrib/streaming/state package (https://github.com/apache/flink/tree/9fe8d7bf870987bf43bad63078e2590a38e4faf6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state).
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
-
Project mention: QuestDB is an open source time-series database for fast ingest and SQL queries | news.ycombinator.com | 2024-08-31
-
Project mention: Trino: A fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-07-09
-
-
starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
-
Even ASF does not use Maven to build some of its projects anymore: Beam, Groovy, Lucene, Geode, POI, and Solr are not built with Maven. Those are not the most popular ASF projects, I know, but still, it is something.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.
-
Hazelcast
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
-
-
-
Project mention: Hive: An Open-Source Data Warehouse Built on Apache Hadoop | news.ycombinator.com | 2024-08-13
-
Apache Ignite — Free and open-source, Apache Ignite is a horizontally scalable key-value cache store system with a robust multi-model database that powers APIs to compute distributed data. Ignite provides a security system that can authenticate users' credentials on the server. It can also be used for system workload acceleration, real-time data processing, analytics, and as a graph-centric programming model.
-
-
Crate
CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
Great initiative making a list of possible Rockset replacements. Would it be possible to open the Notion page for guest contributions?
I would like to add CrateDB (I work there) to the list. CrateDB is a distributed SQL database purposely built for real-time analytics across large datasets of structured and semi-structured data. Similarly to Rockset, it indexes all data in real-time (text, vector, geospatial, time-series, and JSON) for the most efficient search and fast ad hoc query execution at any scale. It is built on top of Apache Lucene and unlike Rockset is open-source (https://github.com/crate/crate).
Rocket frequently comes up among other solutions our users were looking at before choosing CrateDB. For example https://cratedb.com/customers/govspend.
-
-
Project mention: Introducing Promptwright: Synthetic Dataset Generation with Local LLMs | dev.to | 2024-10-28
Push the dataset to hugging face in parquet format
-
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data
-
paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing. Overall, Paimon has the potential to replace the existing Iceberg as the new standard for data lakehousing.
-
LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
-
-
bookkeeper
Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
-
bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Java Big Data discussion
Java Big Data related posts
-
Apache Ozone: Scalable, redundant, distributed object store for Apache Hadoop
-
ClickHouse: The Key to Faster Insights
-
Serverless Data Processing on AWS : AWS Project
-
Show HN: Apache Wayang supports now Kafka
-
Introducing Promptwright: Synthetic Dataset Generation with Local LLMs
-
Apache Zeppelin
-
Streaming Data Alchemy: Apache Kafka Streams Meet Spring Boot
-
A note from our sponsor - SaaSHub
www.saashub.com | 10 Dec 2024
Index
What are some of the best open-source Big Data projects in Java? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Flink | 24,218 |
2 | Presto | 16,090 |
3 | QuestDB | 14,687 |
4 | Trino | 10,564 |
5 | kafka-ui | 9,940 |
6 | starrocks | 9,288 |
7 | beam | 7,905 |
8 | Zeppelin | 6,423 |
9 | Hazelcast | 6,176 |
10 | vespa | 5,868 |
11 | iotdb | 5,646 |
12 | Apache Hive | 5,574 |
13 | Apache Ignite | 4,827 |
14 | Apache Calcite | 4,630 |
15 | Crate | 4,135 |
16 | fastjson2 | 3,817 |
17 | Apache Parquet | 2,660 |
18 | Flume | 2,538 |
19 | paimon | 2,487 |
20 | LakeSoul | 2,392 |
21 | Apache Drill | 1,948 |
22 | bookkeeper | 1,907 |
23 | bitsail | 1,631 |