Top 23 Java Big Data Projects
-
It took me some time to get a good grasp of the power of SQL; and it really kicked in when I learned about optimization rules. It's a program that you rewrite, just like an optimizing compiler would.
You state what you want; you have different ways to fetch and match and massage data; and you can search through this space to produce a physical plan. Hopefully you used knowledge to weight parts to be optimized (table statistics, like Java's JIT would detect hot spots).
I find it fascinating to peer through database code to see what is going on. Lately, there's been new advances towards streaming databases, which bring a whole new design space. For example, now you have latency of individual new rows to optimize for, as opposed to batch it whole to optimize the latency of a dataset. Batch scanning will be benefit from better use of your CPU caches.
And maybe you could have a hybrid system which reads history from a log and aggregates in a batched manner, and then switches to another execution plan when it reaches the end of the log.
If you want to have a peek at that here are Flink's set of rules [1], generic and stream-specific ones. The names can be cryptic, but usually give a good sense of what is going on. For example: PushFilterIntoTableSourceScanRule makes the WHERE clause apply the earliest possible, to save some CPU/network bandwidth further down. PushPartitionIntoTableSourceScanRule tries to make a fan-out/shuffle happen the earliest possible, so that parallelism can be made use of.
[1] https://github.com/apache/flink/blob/5f8fb304fb5d68cdb0b3e3c...
-
Project mention: Let's write a compiler, part 5: A code generator | news.ycombinator.com | 2021-08-19
-
SonarQube
Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
-
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Have you tried Apache Zepellin I remember that you can pretty print spark dataframes directly on it with z.show(df)
-
Project mention: Beginners Guide to Caching Inside an Apache Beam Dataflow Streaming Pipeline Using Python | dev.to | 2022-03-09
will do the job, but due to a bug in versions prior to this commit the tag parameter will be ignored. The cached object is going to be reloaded even if you provide the same identifier, rendering the whole mechanism useless and our transformation will hit our attached resources every time.
-
Project mention: Which data lineage tool did you implement at your company | reddit.com/r/dataengineering | 2022-03-29
I've been playing around with https://datahubproject.io which is in quite active development.
-
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Project mention: Feasibility on startup idea related to data pipelines | reddit.com/r/dataengineering | 2022-03-14For querying various databases, Trino is a distributed SQL query engine that could help - https://trino.io/
-
Scout APM
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
-
Project mention: Show HN: Hazelcast 5 BETA – streaming+storage in one | news.ycombinator.com | 2021-07-16
-
In this article, I'm showing you how to create a Spring Boot app that loads data from Apache Hive via Apache Spark to the Aerospike Database. More than that, I'm giving you a recipe for writing integration tests for such scenarios that can be run either locally or during the CI pipeline execution. The code examples are taken from this repository.
-
Ignite works as you describe:
I wouldn't really recommend this approach, I would think more in terms of subscriptions and topics and less of a 'database'.
-
Project mention: MeiliSearch: A Minimalist Full-Text Search Engine | news.ycombinator.com | 2021-08-15
After looking at various alternatives, I'm thinking of trying out https://vespa.ai/ [0]
-
Crate
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.
Project mention: Parser generators vs. handwritten parsers: surveying major languages in 2021 | news.ycombinator.com | 2021-08-21 -
Project mention: CITIC Industrial Cloud — Apache ShardingSphere Enterprise Applications | dev.to | 2022-04-14
The SQL Federation engine contains processes such as SQL Parser, SQL Binder, SQL Optimizer, Data Fetcher and Operator Calculator, suitable for dealing with co-related queries and subqueries cross multiple database instances. At the underlying layer, it uses Calcite to implement RBO (Rule Based Optimizer) and CBO (Cost Based Optimizer) based on relational algebra, and query the results through the optimal execution plan.
-
Flume
-
-
Project mention: Apache Drill: the reports of my death have been greatly exaggerated | news.ycombinator.com | 2021-11-01
>We’ve started talking about speeding up our release cadence to better reflect our recent activity.
There's been only one release per year in the past so you can't fault anyone to think the project is dead.
-
Project mention: Scalable, fault-tolerant, low-latency storage service for real-time workloads | news.ycombinator.com | 2021-10-26
-
This go implementation, other than common advantages from go itself (small single executable, support multiple platforms, speed, etc.), has some neat features compare with Java parquet tool and Python one like:
-
DatumBox
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
-
And as u/pych_phd said, it's not just Databricks, Snowflake and Azure who make these claims, even AWS, GCP, Dremio and I'm sure many others are too.
-
Hazelcast Jet
-
-
Project mention: Apache Accumulo – sorted, distributed, robust, scalable key/value store | news.ycombinator.com | 2022-04-19
Java Big Data related posts
- Computation reuse via fusion in Amazon Athena
- Apache Spark, Hive, and Spring Boot — Testing Guide
- Apache Accumulo – sorted, distributed, robust, scalable key/value store
- Which data lineage tool did you implement at your company
- Beginners Guide to Caching Inside an Apache Beam Dataflow Streaming Pipeline Using Python
- Launch HN: Hydra (YC W22) – Query Any Database via Postgres
- Metadata extraction and management
Index
What are some of the best open-source Big Data projects in Java? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Flink | 18,920 |
2 | Presto | 13,478 |
3 | Apache Storm | 6,351 |
4 | Zeppelin | 5,667 |
5 | beam | 5,515 |
6 | datahub | 5,496 |
7 | Trino | 5,434 |
8 | Hazelcast | 4,859 |
9 | Apache Hive | 4,281 |
10 | Apache Ignite | 4,154 |
11 | vespa | 3,937 |
12 | Crate | 3,393 |
13 | Apache Calcite | 3,088 |
14 | Flume | 2,266 |
15 | iotdb | 1,975 |
16 | Apache Drill | 1,673 |
17 | bookkeeper | 1,552 |
18 | Apache Parquet | 1,529 |
19 | DatumBox | 1,077 |
20 | dremio-oss | 1,065 |
21 | Hazelcast Jet | 987 |
22 | Apache Phoenix | 930 |
23 | Apache Accumulo | 916 |
Are you hiring? Post a new remote job listing for free.