Hadoop

Open-source projects categorized as Hadoop

Top 23 Hadoop Open-Source Projects

  1. data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. luigi

    Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

  4. APIJSON

    🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users

    Project mention: Top 15 Open-Source Low-Code Projects with the Most GitHub Stars | dev.to | 2024-07-18

    GitHub https://github.com/Tencent/APIJSON GitHub Stars 16.9k Most Recent Update on GitHub 2 days ago Open Source License Apache 2.0 Number of Active Contributors This Year 6 Acceptance of External PRs Yes Official Website http://apijson.cn/ Documentation https://apijsondocs.readthedocs.io/en/latest/

  5. Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Using IRIS and Presto for high-performance and scalable SQL queries | dev.to | 2025-01-19

    The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.

  6. Apache Hadoop

    Apache Hadoop

    Project mention: Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom | dev.to | 2025-03-11

    One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.

  7. Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

    Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29
  8. doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

    Project mention: Apache Iceberg | news.ycombinator.com | 2025-01-25
  11. school-of-sre

    At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

  12. H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  13. Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  14. Apache Hive

    Apache Hive

    Project mention: Hive: An Open-Source Data Warehouse Built on Apache Hadoop | news.ycombinator.com | 2024-08-13
  15. Apache Ignite

    Apache Ignite (by apache)

    Project mention: API Caching: Techniques for Better Performance | dev.to | 2024-10-17

    Apache Ignite — Free and open-source, Apache Ignite is a horizontally scalable key-value cache store system with a robust multi-model database that powers APIs to compute distributed data. Ignite provides a security system that can authenticate users' credentials on the server. It can also be used for system workload acceleration, real-time data processing, analytics, and as a graph-centric programming model.

  16. Apache Calcite

    Apache Calcite

  17. Apache Nutch

    Apache Nutch is an extensible and scalable web crawler

    Project mention: 11 best open-source web crawlers and scrapers in 2024 | dev.to | 2024-10-29

    Language: Java | GitHub: 2.9K+ stars | link

  18. docker-hadoop

    Apache Hadoop docker image

  19. kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  20. winutils

    winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows (by cdarlint)

  21. Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data (by apache)

  22. nagios-plugins

    450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

  23. kylo

    Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

  24. ozone

    Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

    Project mention: Apache Ozone: Scalable, redundant, distributed object store for Apache Hadoop | news.ycombinator.com | 2024-12-04
  25. WeDataSphere

    WeDataSphere is a financial grade, one-stop big data platform suite.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Hadoop discussion

Log in or Post with

Hadoop related posts

  • Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom

    3 projects | dev.to | 11 Mar 2025
  • Apache Hadoop: Pioneering Open Source Innovation in Big Data

    2 projects | dev.to | 6 Mar 2025
  • Commit to Growth: My 2024 Reflection

    1 project | dev.to | 10 Jan 2025
  • Where is Java Used in Industry?

    1 project | dev.to | 18 Dec 2024
  • How to Install PySpark on Your Local Machine

    2 projects | dev.to | 9 Dec 2024
  • Apache Ozone: Scalable, redundant, distributed object store for Apache Hadoop

    1 project | news.ycombinator.com | 4 Dec 2024
  • Apache Doris: open-source data warehouse for real time data analytics

    1 project | news.ycombinator.com | 26 Oct 2024
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 19 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Hadoop projects? This list will help you:

# Project Stars
1 data-science-ipython-notebooks 27,993
2 luigi 18,154
3 APIJSON 17,514
4 Presto 16,247
5 Apache Hadoop 14,975
6 Deeplearning4j 13,861
7 doris 13,322
8 Trino 10,996
9 school-of-sre 7,922
10 H2O 7,066
11 Alluxio (formerly Tachyon) 6,953
12 Apache Hive 5,652
13 Apache Ignite 4,893
14 Apache Calcite 4,759
15 Apache Nutch 2,987
16 docker-hadoop 2,229
17 kyuubi 2,159
18 winutils 2,007
19 Apache Drill 1,960
20 nagios-plugins 1,140
21 kylo 1,111
22 ozone 892
23 WeDataSphere 662

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Java is
the 8th most popular programming language
based on number of references?