Understanding Parquet, Iceberg and Data Lakehouses

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

iceberg-python

2 225 9.7 Python

Apache PyIceberg

You don't need a Spark deployment. The first reference implementations for reading and writing were in Spark.
Now, with PyIceberg, there is read support in Python. Write support should be merged very soon - https://github.com/apache/iceberg-python/pull/41

delta

69 6,897 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…
Here’s the Apache Iceberg table format specification:
https://iceberg.apache.org/spec/
As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.
This is nominally the Delta Lake equivalent:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md
I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.
Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).
My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lance

10 3,256 9.8 Rust

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Parquet has been the lakehouse file format of choice for nearly half a decade. But we are starting to see other contenders that are optimized more for lower latency like lance https://github.com/lancedb/lance

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project