Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 17 Java Hadoop Projects
-
APIJSON
🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
-
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
-
Alluxio (formerly Tachyon)
Alluxio, data orchestration for analytics and machine learning in the cloud
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
-
hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
Project mention: Getting thousands of files of output back from a container | /r/docker | 2023-05-02Did you check out tools like https://hadoop.apache.org/ ?
Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.
Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19
Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | news.ycombinator.com | 2023-07-26> Make diff work on more than just SQLite.
Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).
[0] https://calcite.apache.org/
Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.
Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:
Project mention: Ask HN: Is there any good open-source alternative to MinIO? | news.ycombinator.com | 2023-09-21
Project mention: Support different jdbc platforms and multiple instances of same DBMS | /r/ApacheWayang | 2023-12-05
Java Hadoop related posts
- Log Analysis: Elasticsearch VS Apache Doris
- Ask HN: Is there any good open-source alternative to MinIO?
- Ask HN: What are some SQL transpilers?
- Getting thousands of files of output back from a container
- Trying to run hadoop using docker
- 5 Best Practices For Data Integration To Boost ROI And Efficiency
- Hadoop or Spark?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024
Index
What are some of the best open-source Hadoop projects in Java? This list will help you:
Project | Stars | |
---|---|---|
1 | APIJSON | 16,643 |
2 | Presto | 15,591 |
3 | Apache Hadoop | 14,316 |
4 | Deeplearning4j | 13,424 |
5 | doris | 11,314 |
6 | Trino | 9,552 |
7 | Alluxio (formerly Tachyon) | 6,631 |
8 | Apache Hive | 5,326 |
9 | Apache Ignite | 4,678 |
10 | Apache Calcite | 4,363 |
11 | Apache Nutch | 2,809 |
12 | Apache Drill | 1,894 |
13 | kylo | 1,091 |
14 | ozone | 760 |
15 | venice | 420 |
16 | incubator-wayang | 167 |
17 | hadoopcryptoledger | 141 |
Sponsored