Java Big Data

Open-source Java projects categorized as Big Data

Top 23 Java Big Data Projects

  • Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Ask HN: What are some SQL transpilers? | news.ycombinator.com | 2023-07-14
  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • QuestDB

    An open source time-series database for fast ingest and SQL queries

    Project mention: Annotations in Kubernetes Operator Design | dev.to | 2023-11-26

    In this post, I will detail a way in which I recently used annotations while writing an operator for my company's product, QuestDB. Hopefully this will give you an idea of how you can incorporate annotations into your own operators to harness their full potential.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

    Project mention: Game analytic power: how we process more than 1 billion events per day | dev.to | 2023-11-24

    We decided not to waste time reinventing the wheel and simply installed Trino on our servers. It’s a full featured SQL query engine that works on your data. Now our analysts can use it to work with data from AppMetr and execute queries at different levels of complexity.

  • kafka-ui

    Open-Source Web UI for Apache Kafka Management

    Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

    Project mention: Releasing Temporian, a Python library for processing temporal data, built together with Google | /r/Python | 2023-09-17

    Flexible runtime ☁️: Temporian programs can run seamlessly in-process in Python, on large datasets using Apache Beam.

  • Apache Storm

    Apache Storm

  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • starrocks

    StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

    Project mention: Ask HN: Are there any notable Chinese FLOSS projects? | news.ycombinator.com | 2023-05-11

    https://github.com/apache/doris Is a great example. Same for it's cousin https://github.com/StarRocks/starrocks that was an early fork of the doris project.

    To be fair, these are the only examples I can think of and I only learned of these as I'm standing up new data infra using starrocks.

  • Hazelcast

    Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

    Project mention: Does anyone know any good java implementations for distributed key-value store? | /r/ExperiencedDevs | 2023-06-08

    You're probably looking for Hazelcast here. Note that it does much more than just a distributed k/v, but it will get you where you need to go.

  • Apache Hive

    Apache Hive

    Project mention: Apache Iceberg as storage for on-premise data store (cluster) | /r/dataengineering | 2023-03-16

    Trino or Hive for SQL querying. Get Trino/Hive to talk to Nessie.

  • vespa

    AI + Data, online. https://vespa.ai

    Project mention: Top 10 Best Vector Databases & Libraries | dev.to | 2023-04-19

    Vespa(4.3k ⭐) → A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows you to apply AI to make sense of your data in real time.

  • Apache Ignite

    Apache Ignite (by apache)

  • Apache Calcite

    Apache Calcite

    Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | news.ycombinator.com | 2023-07-26

    > Make diff work on more than just SQLite.

    Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).

    [0] https://calcite.apache.org/

  • iotdb

    Apache IoTDB

  • Crate

    CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

    Project mention: Creating an advanced search engine with PostgreSQL | news.ycombinator.com | 2023-07-12

    I'm wondering if CrateDB [https://github.com/crate/crate] could fit your use case.

    It's a relational SQL database which aims for compatibility with PostgreSQL. Internally it uses Lucene as a storage and such can offer fulltext functionality which is exposed via MATCH.

  • fastjson2

    🚄 FASTJSON2 is a Java JSON library with excellent performance.

    Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20
  • Flume

    Mirror of Apache Flume

  • Apache Parquet

    Apache Parquet

  • LakeSoul

    LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

    Project mention: Open Source first Anniversary Star 1.2K! Review on the anniversary of LakeSoul, the unique open-source Lakehouse | dev.to | 2022-12-28

    Review code reference: https://github.com/meta-soul/LakeSoul/pull/115

  • Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data

    Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30

    Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.

  • bookkeeper

    Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads

    Project mention: Apache Pulsar vs Apache Kafka - How to choose a data streaming platform | dev.to | 2022-12-13

    Is it possible to store data within Kafka and Pulsar? The answer is yes, both systems offer long-term storage solutions, but their underlying implementations differ widely. While Kafka uses logs that are distributed among brokers, Pulsar uses Apache BookKeeper for storage.

  • parquet-format

    Apache Parquet

    Project mention: Summing columns in remote Parquet files using DuckDB | news.ycombinator.com | 2023-11-16

    Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format

    The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-11-26.

Java Big Data related posts

Index

What are some of the best open-source Big Data projects in Java? This list will help you:

Project Stars
1 Apache Flink 22,288
2 Presto 15,226
3 QuestDB 13,016
4 Trino 8,864
5 kafka-ui 7,355
6 beam 7,246
7 Apache Storm 6,504
8 Zeppelin 6,195
9 starrocks 5,828
10 Hazelcast 5,648
11 Apache Hive 5,149
12 vespa 4,960
13 Apache Ignite 4,588
14 Apache Calcite 4,115
15 iotdb 4,077
16 Crate 3,806
17 fastjson2 3,155
18 Flume 2,465
19 Apache Parquet 2,223
20 LakeSoul 2,206
21 Apache Drill 1,851
22 bookkeeper 1,804
23 parquet-format 1,517
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com