Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 12 lakehouse Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
-
LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
-
dataall
A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Local-Data-LakeHouse
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
-
FLiPStackWeekly
FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.
Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
Project mention: I can’t terraform my company’s Databricks environment and I’m going insane. | /r/dataengineering | 2023-06-20Use the Databricks terraform examples the external credentials and external locations in UC should help.
Project mention: Unified storage framework for the entire machine learning lifecycle | news.ycombinator.com | 2024-02-28
lakehouse related posts
- Yandex open-sources its exabyte-scale big data platform
- YTsaurus: Open-source big data platform for distributed storage and processing
- YTsaurus – Yandex open source big data platform
- We have open-sourced Cuelake
- Feedback for open source data engineering tool Cuelake (similar to data bricks)
-
A note from our sponsor - InfluxDB
www.influxdata.com | 27 Apr 2024
Index
What are some of the best open-source lakehouse projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Presto | 15,591 |
2 | doris | 11,314 |
3 | starrocks | 7,764 |
4 | LakeSoul | 2,301 |
5 | ytsaurus | 1,765 |
6 | cuelake | 284 |
7 | dataall | 209 |
8 | terraform-databricks-examples | 177 |
9 | space | 135 |
10 | awesome-data-temporality | 96 |
11 | Local-Data-LakeHouse | 43 |
12 | FLiPStackWeekly | 14 |
Sponsored