Java Hadoop

Open-source Java projects categorized as Hadoop

Top 17 Java Hadoop Projects


    🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.

  • Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Multi-Database Support in DuckDB | | 2024-01-28

    We have some of this functionality in Presto (, but it takes fair bit of work to implement it for all the different backends.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • Apache Hadoop

    Apache Hadoop

    Project mention: Getting thousands of files of output back from a container | /r/docker | 2023-05-02

    Did you check out tools like ?

  • Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

  • doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Five Apache projects you probably didn't know about | | 2023-12-21

    Apache Doris is a real-time data warehouse.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (

    Project mention: Game analytic power: how we process more than 1 billion events per day | | 2023-11-24

    We decided not to waste time reinventing the wheel and simply installed Trino on our servers. It’s a full featured SQL query engine that works on your data. Now our analysts can use it to work with data from AppMetr and execute queries at different levels of complexity.

  • Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Apache Hive

    Apache Hive

    Project mention: Apache Iceberg as storage for on-premise data store (cluster) | /r/dataengineering | 2023-03-16

    Trino or Hive for SQL querying. Get Trino/Hive to talk to Nessie.

  • Apache Ignite

    Apache Ignite (by apache)

  • Apache Calcite

    Apache Calcite

    Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | | 2023-07-26

    > Make diff work on more than just SQLite.

    Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).


  • Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

  • Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data (by apache)

    Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30

    Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.

  • kylo

    Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

    Project mention: GitHub – GSA/code-gov: An informative repo for all repos | | 2023-09-09 :

    > You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*

    > [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you

    > To load a Parquet file, pass a URL to `?parquet=`

    > [...]*

    There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.

    datasette plugins are written in Python and/or JS w/ pluggy:

  • ozone

    Scalable, redundant, and distributed object store for Apache Hadoop

    Project mention: Ask HN: Is there any good open-source alternative to MinIO? | | 2023-09-21
  • venice

    Venice, Derived Data Platform for Planet-Scale Workloads. (by linkedin)

  • incubator-wayang

    Apache Wayang(incubating) is the first cross-platform data processing system.

    Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05
  • hadoopcryptoledger

    Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-01-28.

Java Hadoop related posts


What are some of the best open-source Hadoop projects in Java? This list will help you:

Project Stars
1 APIJSON 16,406
2 Presto 15,458
3 Apache Hadoop 14,182
4 Deeplearning4j 13,367
5 doris 10,915
6 Trino 9,300
7 Alluxio (formerly Tachyon) 6,581
8 Apache Hive 5,262
9 Apache Ignite 4,644
10 Apache Calcite 4,273
11 Apache Nutch 2,773
12 Apache Drill 1,864
13 kylo 1,086
14 ozone 735
15 venice 409
16 incubator-wayang 163
17 hadoopcryptoledger 142
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives