HDFS

Top 13 HDFS Open-Source Projects

  • seaweedfs

    SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

  • Project mention: DwarFS – The Deduplicating Warp-Speed Advanced Read-Only File System | news.ycombinator.com | 2024-04-11

    Whoops: WebDAV:

    https://news.ycombinator.com/item?id=39417503

    SeaweedFS supports WebDAV. https://github.com/seaweedfs/seaweedfs/wiki/WebDAV

    I'm not able to find if both/restic supports mounting backups as WebDAV, but in theory there's nothing stopping you.

    It's 100% user space (expose a rest service) and supported by a bunch of file-browsers with a bit of a network aware component to it as well.

  • Ceph

    Ceph is a distributed object, block, and file storage platform

  • Project mention: First time user sturggles | /r/ceph | 2023-06-24

    curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadmchmod a+x cephadm./cephadm bootstrap --mon-ip 192.168.1.41

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • juicefs

    JuiceFS is a distributed POSIX file system built on top of Redis and S3.

  • Project mention: South Korea's No.1 Search Engine Chose JuiceFS over Alluxio for AI Storage | dev.to | 2024-01-18

    Support for Kerberos keytab files

  • smart_open

    Utils for streaming large files (S3, HDFS, gzip, bz2...)

  • TileDB

    The Universal Storage Engine

  • Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

    TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)

    TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.

    TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))

    With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.

    Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog

    We are actively hiring for several roles including:

    - Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

  • hdfs

    A native go client for HDFS

  • kafka-connect-ui

    Web tool for Kafka Connect |

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • rumble

    ⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)

  • TileDB-Py

    Python interface to the TileDB storage engine

  • elbencho

    A distributed storage benchmark for file systems, object stores & block devices with support for GPUs

  • apache-spark-docker

    Dockerizing an Apache Spark Standalone Cluster

  • hdfs-rs

    libhdfs binding and wrapper APIs for Rust

  • NiFItoKafkaConnect

    NiFi -> Kafka Connect -> HDFS

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

HDFS related posts

  • DuckDB + dbt for a serverless event correlation pipeline?

    1 project | /r/dataengineering | 7 Dec 2023
  • SeaweedFS

    1 project | /r/devopspro | 9 Jun 2023
  • Experience running rook-ceph in production/large clusters

    1 project | /r/kubernetes | 21 Apr 2023
  • First Homelab as a 19yr old Software Developer

    2 projects | /r/homelab | 15 Apr 2023
  • SeaweedFS is a fast distributed storage system for blobs, objects and files

    1 project | news.ycombinator.com | 3 Apr 2023
  • pandas 2.0 and the Arrow revolution (part I)

    1 project | /r/dataengineering | 23 Feb 2023
  • SeaweedFS vs JuiceFS

    1 project | dev.to | 16 Feb 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 4 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source HDFS projects? This list will help you:

Project Stars
1 seaweedfs 21,123
2 Ceph 13,259
3 juicefs 9,824
4 smart_open 3,093
5 TileDB 1,764
6 hdfs 1,347
7 kafka-connect-ui 496
8 rumble 207
9 TileDB-Py 179
10 elbencho 147
11 apache-spark-docker 40
12 hdfs-rs 33
13 NiFItoKafkaConnect 3

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com