Java Big Data

Open-source Java projects categorized as Big Data

Top 23 Java Big Data Projects

  • CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  • Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

    We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

  • QuestDB

    QuestDB is an open source time-series database for fast ingest and SQL queries

    Project mention: QuestDB is an open source time-series database for fast ingest and SQL queries | news.ycombinator.com | 2024-08-31
  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

    Project mention: Trino: A fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-07-09
  • kafka-ui

    Open-Source Web UI for Apache Kafka Management

    Project mention: How to Get Remote Code Execution in Kafka UI | news.ycombinator.com | 2024-07-22
  • starrocks

    The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

    Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

    tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb

    Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

    Project mention: No SNAPSHOTs | dev.to | 2024-07-30

    Even ASF does not use Maven to build some of its projects anymore: Beam, Groovy, Lucene, Geode, POI, and Solr are not built with Maven. Those are not the most popular ASF projects, I know, but still, it is something.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Serverless Data Processing on AWS : AWS Project | dev.to | 2024-11-13

    To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.

  • Hazelcast

    Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

  • vespa

    AI + Data, online. https://vespa.ai

  • iotdb

    Apache IoTDB

  • Apache Hive

    Apache Hive

    Project mention: Hive: An Open-Source Data Warehouse Built on Apache Hadoop | news.ycombinator.com | 2024-08-13
  • Apache Ignite

    Apache Ignite (by apache)

    Project mention: API Caching: Techniques for Better Performance | dev.to | 2024-10-17

    Apache Ignite — Free and open-source, Apache Ignite is a horizontally scalable key-value cache store system with a robust multi-model database that powers APIs to compute distributed data. Ignite provides a security system that can authenticate users' credentials on the server. It can also be used for system workload acceleration, real-time data processing, analytics, and as a graph-centric programming model.

  • Apache Calcite

    Apache Calcite

  • Crate

    CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

    Project mention: OpenAI Acquires Rockset | news.ycombinator.com | 2024-06-21

    Great initiative making a list of possible Rockset replacements. Would it be possible to open the Notion page for guest contributions?

    I would like to add CrateDB (I work there) to the list. CrateDB is a distributed SQL database purposely built for real-time analytics across large datasets of structured and semi-structured data. Similarly to Rockset, it indexes all data in real-time (text, vector, geospatial, time-series, and JSON) for the most efficient search and fast ad hoc query execution at any scale. It is built on top of Apache Lucene and unlike Rockset is open-source (https://github.com/crate/crate).

    Rocket frequently comes up among other solutions our users were looking at before choosing CrateDB. For example https://cratedb.com/customers/govspend.

  • fastjson2

    🚄 FASTJSON2 is a Java JSON library with excellent performance.

  • Apache Parquet

    Apache Parquet Java

    Project mention: Introducing Promptwright: Synthetic Dataset Generation with Local LLMs | dev.to | 2024-10-28

    Push the dataset to hugging face in parquet format

  • Flume

    Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log-like data

  • paimon

    Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

    Project mention: Apache Paimon Playground ft. Flink and Trino | dev.to | 2024-11-24

    Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing. Overall, Paimon has the potential to replace the existing Iceberg as the new standard for data lakehousing.

  • LakeSoul

    LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

  • Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data (by apache)

  • bookkeeper

    Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads

  • bitsail

    BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Java Big Data discussion

Log in or Post with

Java Big Data related posts

  • Apache Ozone: Scalable, redundant, distributed object store for Apache Hadoop

    1 project | news.ycombinator.com | 4 Dec 2024
  • ClickHouse: The Key to Faster Insights

    4 projects | dev.to | 2 Dec 2024
  • Serverless Data Processing on AWS : AWS Project

    5 projects | dev.to | 13 Nov 2024
  • Show HN: Apache Wayang supports now Kafka

    1 project | news.ycombinator.com | 4 Nov 2024
  • Introducing Promptwright: Synthetic Dataset Generation with Local LLMs

    2 projects | dev.to | 28 Oct 2024
  • Apache Zeppelin

    6 projects | news.ycombinator.com | 2 Sep 2024
  • Streaming Data Alchemy: Apache Kafka Streams Meet Spring Boot

    1 project | dev.to | 19 Aug 2024
  • A note from our sponsor - SaaSHub
    www.saashub.com | 10 Dec 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Big Data projects in Java? This list will help you:

Project Stars
1 Apache Flink 24,218
2 Presto 16,090
3 QuestDB 14,687
4 Trino 10,564
5 kafka-ui 9,940
6 starrocks 9,288
7 beam 7,905
8 Zeppelin 6,423
9 Hazelcast 6,176
10 vespa 5,868
11 iotdb 5,646
12 Apache Hive 5,574
13 Apache Ignite 4,827
14 Apache Calcite 4,630
15 Crate 4,135
16 fastjson2 3,817
17 Apache Parquet 2,660
18 Flume 2,538
19 paimon 2,487
20 LakeSoul 2,392
21 Apache Drill 1,948
22 bookkeeper 1,907
23 bitsail 1,631

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai