Java Big Data

Open-source Java projects categorized as Big Data | Edit details

Top 23 Java Big Data Projects

  • GitHub repo Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Let's write a compiler, part 5: A code generator | | 2021-08-19
  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo Apache Storm

    Mirror of Apache Storm

  • GitHub repo Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: What libraries do you use for machine learning and data visualizing in scala? | | 2021-11-27

    Another more widely used notebooks for scala and spark:

  • GitHub repo beam

    Apache Beam is a unified programming model for Batch and Streaming

    Project mention: The Data Engineer Roadmap 🗺 | | 2021-10-19

    Apache Beam

  • GitHub repo Hazelcast

    Open-source distributed computation and storage platform

    Project mention: Show HN: Hazelcast 5 BETA – streaming+storage in one | | 2021-07-16
  • GitHub repo Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (

    Project mention: Learn SQL | | 2021-08-03

    You might find interesting. It allows you to bolt on a MPP SQL execution engine on top of any data source including pre-built connectors for Druid and Kafka.

    It's all ANSI SQL and the best part is you can combine data from heterogenous sources. e.g. You can join data between a topic in Kafka and a table in Druid or even between Kafka, S3 and your RDBMS.

    Disclaimer: I'm a maintainer of the project.

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo datahub

    The Metadata Platform for the Modern Data Stack

    Project mention: Two Methods to Scan for PII in Data Warehouses | | 2021-11-29

    An important requirement for data privacy and protection is to find and catalog tables and columns that contain PII or PHI data in a data warehouse. Open source data catalogs like Datahub and Amundsen enable cataloging of information in data warehouses. Moreover, tables and columns can be tagged including PII and type of PII tags.

  • GitHub repo Apache Ignite

    Apache Ignite (by apache)

    Project mention: .NET and Apache Ignite: Testing Cache and SQL API features — Part I | | 2021-09-11

    Last days, I started using Apache Ignite as a cache strategy for some applications. Apache Ignite is an open-source In-Memory Data Grid, distributed database, caching, and high-performance computing platform.

  • GitHub repo Apache Hive

    Apache Hive

    Project mention: Understanding SQL Dialects | | 2021-11-17

    Apache Hive takes in a specific SQL dialect and converts it to map-reduce.

  • GitHub repo vespa

    The open big data serving engine.

    Project mention: MeiliSearch: A Minimalist Full-Text Search Engine | | 2021-08-15

    After looking at various alternatives, I'm thinking of trying out [0]


  • GitHub repo Apache Calcite

    Apache Calcite

    Project mention: Anyone know of any software that can help in designing then outputting to various database | | 2021-11-21

    Abstraction Layer - You can use something like Calcite to abstract out your data storage.

  • GitHub repo Flume

    Mirror of Apache Flume

    Project mention: 12-Factor App For Dummies | | 2021-11-01


  • GitHub repo Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data

    Project mention: Apache Drill: the reports of my death have been greatly exaggerated | | 2021-11-01

    >We’ve started talking about speeding up our release cadence to better reflect our recent activity.

    There's been only one release per year in the past so you can't fault anyone to think the project is dead.

  • GitHub repo bookkeeper

    Apache Bookkeeper

    Project mention: Scalable, fault-tolerant, low-latency storage service for real-time workloads | | 2021-10-26
  • GitHub repo Apache Parquet

    Apache Parquet

    Project mention: Writing Apache Parquet Files | | 2021-05-30

    Hi, I've been trying to write parquet files on android for the past couple of days, and have really been struggling to find a solution. My original hypothesis was to just use the java parquet implementation (, but I've since realized that not all java libraries play well with Android. I've gone through essentially dependency hell trying to franken-fit the library into my project, and imported as much as i could before hitting walls such as this one (

  • GitHub repo DatumBox

    Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

  • GitHub repo dremio-oss

    Dremio - the missing link in modern data

    Project mention: Build your own “data lake” for reporting purposes | | 2021-03-14

    For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use ( to query them on lake (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.

  • GitHub repo Hazelcast Jet

    Distributed Stream and Batch Processing

    Project mention: Updating data files, commits vs. pull requests | | 2021-08-15

    Hazelcast Jet

  • GitHub repo Apache Phoenix

    Mirror of Apache Phoenix (by apache)

  • GitHub repo Apache Accumulo

    Apache Accumulo

  • GitHub repo Rakam

    📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

    Project mention: Show HN: Lightdash – An open source Looker alternative | | 2021-06-03

    Well done! I was actually looking for an open source LookML a while back and found Rakam[0]. It seems they added the dbt layer after the fact while you started with that concept. Product looks slick, good luck?

    By the way, what happened with Hubble?

    0 -

  • GitHub repo Mockneat

    MockNeat - the modern faker lib.

    Project mention: Proiectele open source la care am lucrat în ultimă perioada | | 2021-08-01 - o librărie Java care permite crearea de date aleatoare, utila in testarea de aplicații.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-11-29.

Java Big Data related posts


What are some of the best open-source Big Data projects in Java? This list will help you:

Project Stars
1 Apache Flink 17,595
2 Presto 12,873
3 Apache Storm 6,297
4 Zeppelin 5,486
5 beam 5,110
6 Hazelcast 4,631
7 Trino 4,460
8 datahub 4,122
9 Apache Ignite 4,009
10 Apache Hive 3,993
11 vespa 3,727
12 Apache Calcite 2,783
13 Flume 2,188
14 Apache Drill 1,616
15 bookkeeper 1,430
16 Apache Parquet 1,426
17 DatumBox 1,073
18 dremio-oss 989
19 Hazelcast Jet 951
20 Apache Phoenix 907
21 Apache Accumulo 895
22 Rakam 786
23 Mockneat 440
Find remote jobs at our new job board There are 33 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives