Top 23 Hadoop Open-Source Projects

data-science-ipython-notebooks

1 26,490 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
luigi

14 17,327 6.3 Python

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Project mention: Ask HN: What is the correct way to deal with pipelines? | news.ycombinator.com | 2023-09-21

I agree there are many options in this space. Two others to consider:
- https://airflow.apache.org/
- https://github.com/spotify/luigi
There are also many Kubernetes based options out there. For the specific use case you specified, you might even consider a plain old Makefile and incrond if you expect these all to run on a single host and be triggered by a new file showing up in a directory…

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
APIJSON

0 16,659 8.3 Java

🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码，前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
Presto

14 15,603 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

Apache Hadoop

26 14,342 9.9 Java

Apache Hadoop
Deeplearning4j

13 13,427 5.8 Java

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29

doris

42 11,363 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Trino

44 9,576 10.0 Java

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19

school-of-sre

2 7,644 5.0 HTML

At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

Project mention: School of SRE: Curriculum for onboarding non-traditional hires and new grads | /r/hypeurls | 2023-09-11

H2O

10 6,737 9.7 Jupyter Notebook

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

Alluxio (formerly Tachyon)

0 6,654 9.7 Java

Alluxio, data orchestration for analytics and machine learning in the cloud
Apache Hive

14 5,335 9.6 Java

Apache Hive
Apache Ignite

3 4,693 9.5 Java

Apache Ignite (by apache)
Apache Calcite

28 4,368 9.0 Java

Apache Calcite

Project mention: Data diffs: Algorithms for explaining what changed in a dataset (2022) | news.ycombinator.com | 2023-07-26

> Make diff work on more than just SQLite.
Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).
[0] https://calcite.apache.org/

Apache Nutch

3 2,812 8.0 Java

Apache Nutch is an extensible and scalable web crawler
docker-hadoop

4 2,118 0.0 Shell

Apache Hadoop docker image
kyuubi

1 1,941 9.8 Scala

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Apache Drill

9 1,895 8.1 Java

Apache Drill is a distributed MPP query layer for self describing data (by apache)

Project mention: Git Query Language (GQL) Aggregation Functions, Groups, Alias | /r/ProgrammingLanguages | 2023-06-30

Also are you familiar with apache drill . The idea is to put an SQL interpreter in front of any kind of database just like you are doing for git here.

winutils

4 1,773 2.2 Shell

winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows (by cdarlint)

Project mention: Unable to write dataframe to files using PySpark on Pycharm | /r/apachespark | 2023-12-11

Hi guys, I am unable to write the dataframe to files in Pyspark 3.5, I am using python3.11.6 along with jdk11.0.21 also for the winutils file I am using this file winutils/hadoop-3.3.5/bin at master · cdarlint/winutils · GitHub. I've also added the code below, any help would be appreciated.

MooseFS

3 1,587 4.8 C

MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
nagios-plugins

2 1,119 8.4 Python

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
kylo

1 1,091 10.0 Java

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09

https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:

ozone

2 762 9.9 Java

Scalable, redundant, and distributed object store for Apache Hadoop

Project mention: Ask HN: Is there any good open-source alternative to MinIO? | news.ycombinator.com | 2023-09-21

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Hadoop related posts

Unable to write dataframe to files using PySpark on Pycharm

1 project | /r/apachespark | 11 Dec 2023
Log Analysis: Elasticsearch VS Apache Doris

1 project | dev.to | 16 Oct 2023
Ask HN: Is there any good open-source alternative to MinIO?

1 project | news.ycombinator.com | 21 Sep 2023
Ask HN: What are some SQL transpilers?

2 projects | news.ycombinator.com | 14 Jul 2023
Getting thousands of files of output back from a container

1 project | /r/docker | 2 May 2023
Trying to run hadoop using docker

1 project | /r/hadoop | 3 Apr 2023
Unveiling the Analytics Industry in Bangalore

3 projects | /r/u_Khushisondhi7 | 23 Mar 2023
A note from our sponsor - SaaSHub
www.saashub.com | 4 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Hadoop projects? This list will help you:

	Project	Stars
1	data-science-ipython-notebooks	26,490
2	luigi	17,327
3	APIJSON	16,659
4	Presto	15,603
5	Apache Hadoop	14,342
6	Deeplearning4j	13,427
7	doris	11,363
8	Trino	9,576
9	school-of-sre	7,644
10	H2O	6,737
11	Alluxio (formerly Tachyon)	6,654
12	Apache Hive	5,335
13	Apache Ignite	4,693
14	Apache Calcite	4,368
15	Apache Nutch	2,812
16	docker-hadoop	2,118
17	kyuubi	1,941
18	Apache Drill	1,895
19	winutils	1,773
20	MooseFS	1,587
21	nagios-plugins	1,119
22	kylo	1,091
23	ozone	762

Hadoop

Top 23 Hadoop Open-Source Projects

Hadoop related posts

Unable to write dataframe to files using PySpark on Pycharm

Log Analysis: Elasticsearch VS Apache Doris

Ask HN: Is there any good open-source alternative to MinIO?

Ask HN: What are some SQL transpilers?

Getting thousands of files of output back from a container

Trying to run hadoop using docker

Unveiling the Analytics Industry in Bangalore

Index