Java Big Data

Open-source Java projects categorized as Big Data | Edit details

Top 23 Java Big Data Projects

  • Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Let's write a compiler, part 5: A code generator | news.ycombinator.com | 2021-08-19
  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • Apache Storm

    Mirror of Apache Storm

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Visualization using Pyspark Dataframe | reddit.com/r/dataengineering | 2022-05-14

    Have you tried Apache Zepellin I remember that you can pretty print spark dataframes directly on it with z.show(df)

  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

    Project mention: Beginners Guide to Caching Inside an Apache Beam Dataflow Streaming Pipeline Using Python | dev.to | 2022-03-09

    will do the job, but due to a bug in versions prior to this commit the tag parameter will be ignored. The cached object is going to be reloaded even if you provide the same identifier, rendering the whole mechanism useless and our transformation will hit our attached resources every time.

  • datahub

    The Metadata Platform for the Modern Data Stack

    Project mention: Which data lineage tool did you implement at your company | reddit.com/r/dataengineering | 2022-03-29

    I've been playing around with https://datahubproject.io which is in quite active development.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

    Project mention: Feasibility on startup idea related to data pipelines | reddit.com/r/dataengineering | 2022-03-14

    For querying various databases, Trino is a distributed SQL query engine that could help - https://trino.io/

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • Hazelcast

    Open-source distributed computation and storage platform

    Project mention: Show HN: Hazelcast 5 BETA – streaming+storage in one | news.ycombinator.com | 2021-07-16
  • Apache Hive

    Apache Hive

    Project mention: Apache Spark, Hive, and Spring Boot — Testing Guide | dev.to | 2022-04-22

    In this article, I'm showing you how to create a Spring Boot app that loads data from Apache Hive via Apache Spark to the Aerospike Database. More than that, I'm giving you a recipe for writing integration tests for such scenarios that can be run either locally or during the CI pipeline execution. The code examples are taken from this repository.

  • Apache Ignite

    Apache Ignite (by apache)

    Project mention: Ask HN: P2P Databases? | news.ycombinator.com | 2022-03-01

    Ignite works as you describe:

    https://ignite.apache.org/

    I wouldn't really recommend this approach, I would think more in terms of subscriptions and topics and less of a 'database'.

  • vespa

    The open big data serving engine. https://vespa.ai

    Project mention: MeiliSearch: A Minimalist Full-Text Search Engine | news.ycombinator.com | 2021-08-15

    After looking at various alternatives, I'm thinking of trying out https://vespa.ai/ [0]

    [0] https://github.com/vespa-engine/vespa

  • Crate

    CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

    Project mention: Parser generators vs. handwritten parsers: surveying major languages in 2021 | news.ycombinator.com | 2021-08-21
  • Apache Calcite

    Apache Calcite

    Project mention: CITIC Industrial Cloud — Apache ShardingSphere Enterprise Applications | dev.to | 2022-04-14

    The SQL Federation engine contains processes such as SQL Parser, SQL Binder, SQL Optimizer, Data Fetcher and Operator Calculator, suitable for dealing with co-related queries and subqueries cross multiple database instances. At the underlying layer, it uses Calcite to implement RBO (Rule Based Optimizer) and CBO (Cost Based Optimizer) based on relational algebra, and query the results through the optimal execution plan.

  • Flume

    Mirror of Apache Flume

    Project mention: 12-Factor App For Dummies | dev.to | 2021-11-01

    Flume

  • iotdb

    Apache IoTDB

    Project mention: Apache IoTDB | news.ycombinator.com | 2022-02-05
  • Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data

    Project mention: Apache Drill: the reports of my death have been greatly exaggerated | news.ycombinator.com | 2021-11-01

    >We’ve started talking about speeding up our release cadence to better reflect our recent activity.

    There's been only one release per year in the past so you can't fault anyone to think the project is dead.

    https://github.com/apache/drill/releases

  • bookkeeper

    Apache Bookkeeper

    Project mention: Scalable, fault-tolerant, low-latency storage service for real-time workloads | news.ycombinator.com | 2021-10-26
  • Apache Parquet

    Apache Parquet

    Project mention: parquet-tools | reddit.com/r/golang | 2022-01-23

    This go implementation, other than common advantages from go itself (small single executable, support multiple platforms, speed, etc.), has some neat features compare with Java parquet tool and Python one like:

  • DatumBox

    Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

  • dremio-oss

    Dremio - the missing link in modern data

    Project mention: Data Lakehouse and Delta Lake | reddit.com/r/dataengineering | 2022-05-03

    And as u/pych_phd said, it's not just Databricks, Snowflake and Azure who make these claims, even AWS, GCP, Dremio and I'm sure many others are too.

  • Hazelcast Jet

    Distributed Stream and Batch Processing

    Project mention: Updating data files, commits vs. pull requests | dev.to | 2021-08-15

    Hazelcast Jet

  • Apache Phoenix

    Mirror of Apache Phoenix (by apache)

  • Apache Accumulo

    Apache Accumulo

    Project mention: Apache Accumulo – sorted, distributed, robust, scalable key/value store | news.ycombinator.com | 2022-04-19
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-05-20.

Java Big Data related posts

Index

What are some of the best open-source Big Data projects in Java? This list will help you:

Project Stars
1 Apache Flink 18,920
2 Presto 13,478
3 Apache Storm 6,351
4 Zeppelin 5,667
5 beam 5,515
6 datahub 5,496
7 Trino 5,434
8 Hazelcast 4,859
9 Apache Hive 4,281
10 Apache Ignite 4,154
11 vespa 3,937
12 Crate 3,393
13 Apache Calcite 3,088
14 Flume 2,266
15 iotdb 1,975
16 Apache Drill 1,673
17 bookkeeper 1,552
18 Apache Parquet 1,529
19 DatumBox 1,077
20 dremio-oss 1,065
21 Hazelcast Jet 987
22 Apache Phoenix 930
23 Apache Accumulo 916
Find remote jobs at our new job board 99remotejobs.com. There are 7 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com