Apache Hudi - The Streaming Data Lake Platform

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • RocksDB

    A library that provides an embeddable, persistent key-value store for fast storage.

  • Hudi tables can be used as sinks for Spark/Flink pipelines and the Hudi writing path provides several enhanced capabilities over file writing done by vanilla parquet/avro sinks. Hudi classifies write operations carefully into incremental (insert, upsert, delete) and batch/bulk operations (insert_overwrite, insert_overwrite_table, delete_partition, bulk_insert) and provides relevant functionality for each operation in a performant and cohesive way. Both upsert and delete operations automatically handle merging of records with the same key in the input stream (say, a CDC stream obtained from upstream table) and then lookup the index, finally invoke a bin packing algorithm to pack data into files, while respecting a pre-configured target file size. An insert operation on the other hand, is intelligent enough to avoid the precombining and index lookup, while retaining the benefits of the rest of the pipeline. Similarly, bulk_insert operation provides several sort modes for controlling initial file sizes and file counts, when importing data from an external table to Hudi. The other batch write operations provide MVCC based implementations of typical overwrite semantics used in batch data pipelines, while retaining all the transactional and incremental processing capabilities, making it seamless to switch between incremental pipelines for regular runs and batch pipelines for backfilling/dropping older partitions. The write pipeline also contains lower layers optimizations around handling large merges by spilling to rocksDB or an external spillable map, multi-threaded/concurrent I/O to improve write performance.

  • iceberg

    Apache Iceberg

  • But first, we needed to tackle the basics - transactions and mutability - on the data lake. In many ways, Apache Hudi pioneered the transactional data lake movement as we know it today. Specifically, during a time when more special-purpose systems were being born, Hudi introduced a server-less, transaction layer, which worked over the general-purpose Hadoop FileSystem abstraction on Cloud Stores/HDFS. This model helped Hudi to scale writers/readers to 1000s of cores on day one, compared to warehouses which offer a richer set of transactional guarantees but are often bottlenecked by the 10s of servers that need to handle them. We also experience a lot of joy to see similar systems (Delta Lake for e.g) later adopt the same server-less transaction layer model that we originally shared way back in early '17. We consciously introduced two table types Copy On Write (with simpler operability) and Merge On Read (for greater flexibility) and now these terms are used in projects outside Hudi, to refer to similar ideas being borrowed from Hudi. Through open sourcing and graduating from the Apache Incubator, we have made some great progress elevating these ideas across the industry, as well as bringing them to life with a cohesive software stack. Given the exciting developments in the past year or so that have propelled data lakes further mainstream, we thought some perspective can help users see Hudi with the right lens, appreciate what it stands for, and be a part of where it’s headed. At this time, we also wanted to shine some light on all the great work done by 180+ contributors on the project, working with more than 2000 unique users over slack/github/jira, contributing all the different capabilities Hudi has gained over the past years, from its humble beginnings.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • hudi

    Upserts, Deletes And Incremental Processing on Big Data.

  • But first, we needed to tackle the basics - transactions and mutability - on the data lake. In many ways, Apache Hudi pioneered the transactional data lake movement as we know it today. Specifically, during a time when more special-purpose systems were being born, Hudi introduced a server-less, transaction layer, which worked over the general-purpose Hadoop FileSystem abstraction on Cloud Stores/HDFS. This model helped Hudi to scale writers/readers to 1000s of cores on day one, compared to warehouses which offer a richer set of transactional guarantees but are often bottlenecked by the 10s of servers that need to handle them. We also experience a lot of joy to see similar systems (Delta Lake for e.g) later adopt the same server-less transaction layer model that we originally shared way back in early '17. We consciously introduced two table types Copy On Write (with simpler operability) and Merge On Read (for greater flexibility) and now these terms are used in projects outside Hudi, to refer to similar ideas being borrowed from Hudi. Through open sourcing and graduating from the Apache Incubator, we have made some great progress elevating these ideas across the industry, as well as bringing them to life with a cohesive software stack. Given the exciting developments in the past year or so that have propelled data lakes further mainstream, we thought some perspective can help users see Hudi with the right lens, appreciate what it stands for, and be a part of where it’s headed. At this time, we also wanted to shine some light on all the great work done by 180+ contributors on the project, working with more than 2000 unique users over slack/github/jira, contributing all the different capabilities Hudi has gained over the past years, from its humble beginnings.

  • Apache Orc

    Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

  • The following stack captures layers of software components that make up Hudi, with each layer depending on and drawing strength from the layer below. Typically, data lake users write data out once using an open file format like Apache Parquet/ORC stored on top of extremely scalable cloud storage or distributed file systems. Hudi provides a self-managing data plane to ingest, transform and manage this data, in a way that unlocks incremental data processing on them.

  • javalin

    Discontinued A simple and modern Java and Kotlin web framework [Moved to: https://github.com/javalin/javalin]

  • Storing and serving table metadata right on the lake storage is scalable, but can be much less performant compared to RPCs against a scalable meta server. Most cloud warehouses internally are built on a metadata layer that leverages an external database (e.g Snowflake uses foundationDB). Hudi also provides a metadata server, called the “Timeline server”, which offers an alternative backing store for Hudi’s table metadata. Currently, the timeline server runs embedded in the Hudi writer processes, serving file listings out of a local rocksDB store/Javalin REST API during the write process, without needing to repeatedly list the cloud storage. Given we have hardened this as the default option since our 0.6.0 release, we are considering standalone timeline server installations, with support for horizontal scaling, database/table mappings, security and all the features necessary to turn it into a highly performant next generation lake metastore.

  • debezium

    Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

  • True to its design goals, Hudi provides some very powerful incremental querying capabilities that tied together the meta fields added during writing and the file group based storage layout. While table formats that merely track files, are only able to provide information about files that changed during each snapshot or commits, Hudi generates the exact set of records that changed given a point in the timeline, due to tracking of record level event and arrival times. Further more, this design allows large commits to be consumed in smaller chunks by an incremental query, fully decoupling the writing and incremental querying of data. Time travel is merely implemented as an incremental query that starts and stops at an older portion of the timeline. Since Hudi ensures that a key is atomically mapped to a single file group at any point in time, it makes it possible to support full CDC capabilities on Hudi tables, such as providing all possible values for a given record since time t, CDC streams with both before and after images. All of these functionalities can be built local to each file group, given each file group is a self-contained log. Much of our future work in this area will be around bringing such a powerful set of debezium like capabilities to life in the coming months.

  • Apache Avro

    Apache Avro is a data serialization system.

  • Hudi is designed around the notion of base file and delta log files that store updates/deltas to a given base file (called a file slice). Their formats are pluggable, with Parquet (columnar access) and HFile (indexed access) being the supported base file formats today. The delta logs encode data in Avro (row oriented) format for speedier logging (just like Kafka topics for e.g). Going forward, we plan to inline any base file format into log blocks in the coming releases, providing columnar access to delta logs depending on block sizes. Future plans also include Orc base/log file formats, unstructured data formats (free form json, images), and even tiered storage layers in event-streaming systems/OLAP engines/warehouses, work with their native file formats.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts