Understanding Parquet, Iceberg and Data Lakehouses

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • iceberg-python

    Apache PyIceberg

  • You don't need a Spark deployment. The first reference implementations for reading and writing were in Spark.

    Now, with PyIceberg, there is read support in Python. Write support should be merged very soon - https://github.com/apache/iceberg-python/pull/41

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…

    Here’s the Apache Iceberg table format specification:

    https://iceberg.apache.org/spec/

    As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.

    This is nominally the Delta Lake equivalent:

    https://github.com/delta-io/delta/blob/master/PROTOCOL.md

    I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.

    Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).

    My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

  • lance

    Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

  • Parquet has been the lakehouse file format of choice for nearly half a decade. But we are starting to see other contenders that are optimized more for lower latency like lance https://github.com/lancedb/lance

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 4 best opensource projects about big data you should try out

    3 projects | /r/learnprogramming | 24 Mar 2022
  • How I Built an In-Cabin Perception Dataset

    2 projects | dev.to | 17 Jun 2024
  • Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

    1 project | dev.to | 26 Apr 2024
  • Show HN: I made a ROS package for realtime semantic segmentation

    1 project | news.ycombinator.com | 26 Apr 2024
  • The Nimble File Format by Meta

    2 projects | news.ycombinator.com | 25 Apr 2024