hudi
big-data-pipeline-lambda-arch
Our great sponsors
hudi | big-data-pipeline-lambda-arch | |
---|---|---|
20 | 1 | |
5,066 | 161 | |
2.2% | - | |
9.9 | 1.5 | |
3 days ago | about 1 year ago | |
Java | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
hudi
-
Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog
Apache Iceberg is one of the three types of lakehouse, the other two are Apache Hudi and Delta Lake.
-
The "Big Three's" Data Storage Offerings
Structured, Semi-structured and Unstructured can be stored in one single format, a lakehouse storage format like Delta, Iceberg or Hudi (assuming those don't require low-latency SLAs like subsecond).
-
Data-eng related highlights from the latest Thoughtworks Tech Radar
Apache Hudi
- For those of you with Lakehouse Architectures, how do you handle duplicate records?
-
AWS ACID data lakehouse
Try Apache Hudi, it is fully integrated with AWS and offers almost everything that you requested.
-
Data n00b looking for guidance on how to setup data lake/warehouse
the corresponding kafka topics have 30d retention and I intend on having s3 sink connector for long term storage (open to other ideas here too, I noticed theres a hudi connector also)
- apache/hudi: Upserts, Deletes And Incremental Processing on Big Data.
- Big Data file formats
-
How-to-Guide: Contributing to Open Source
Apache Hudi
-
What do you use for Data versioning?
You could have a look at Apache Hudi - especially if you're running your Data Pipelines using Spark or Flink.
big-data-pipeline-lambda-arch
-
Lambda Architecture: How to Build a Big Data Pipeline
GitHub project
What are some alternatives?
iceberg - Apache Iceberg
OpenMetadata - Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
kudu - Mirror of Apache Kudu
Thingsboard - Open-source IoT Platform - Device management, data collection, processing and visualization.
Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache Avro - Apache Avro is a data serialization system.
debezium - Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Zeppelin - Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
pinot - Apache Pinot - A realtime distributed OLAP datastore
dataCompare - big data comparison and data profiling platform: low codeļ¼data comparison and data profiling
delta - An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs