Top 12 lakehouse Open-Source Projects

Presto

14 15,591 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

doris

42 11,314 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
starrocks

12 7,764 10.0 Java

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

LakeSoul

21 2,301 9.3 Java

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
ytsaurus

4 1,765 10.0 C++

YTsaurus is a scalable and fault-tolerant open-source big data platform.
cuelake

2 284 0.0 JavaScript

Use SQL to build ELT pipelines on a data lakehouse.
dataall

1 209 9.2 Python

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
terraform-databricks-examples

1 177 7.3 HCL

Examples of using Terraform to deploy Databricks resources

Project mention: I can’t terraform my company’s Databricks environment and I’m going insane. | /r/dataengineering | 2023-06-20

Use the Databricks terraform examples the external credentials and external locations in UC should help.

space

1 135 8.9 Python

Unified storage framework for the entire machine learning lifecycle (by google)

Project mention: Unified storage framework for the entire machine learning lifecycle | news.ycombinator.com | 2024-02-28

awesome-data-temporality

17 96 10.0

A curated list to help you manage temporal data across many modalities 🚀.

Project mention: FLaNK Stack Weekly for 14 Aug 2023 | dev.to | 2023-08-14

Local-Data-LakeHouse

1 43 4.4 Dockerfile

Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
FLiPStackWeekly

79 14 9.9

FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...

Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

lakehouse related posts

Yandex open-sources its exabyte-scale big data platform
2 projects | news.ycombinator.com | 22 Mar 2023
YTsaurus: Open-source big data platform for distributed storage and processing
1 project | news.ycombinator.com | 21 Mar 2023
YTsaurus – Yandex open source big data platform
1 project | /r/CKsTechNews | 20 Mar 2023
We have open-sourced Cuelake
1 project | /r/bigdata | 22 Jun 2021
Feedback for open source data engineering tool Cuelake (similar to data bricks)
1 project | /r/dataengineering | 15 Jun 2021
A note from our sponsor - InfluxDB
www.influxdata.com | 27 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source lakehouse projects? This list will help you:

	Project	Stars
1	Presto	15,591
2	doris	11,314
3	starrocks	7,764
4	LakeSoul	2,301
5	ytsaurus	1,765
6	cuelake	284
7	dataall	209
8	terraform-databricks-examples	177
9	space	135
10	awesome-data-temporality	96
11	Local-Data-LakeHouse	43
12	FLiPStackWeekly	14