Top 23 Olap Open-Source Projects

ClickHouse

208 34,054 10.0 C++

ClickHouse® is a free analytics DBMS for big data

Project mention: We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions | news.ycombinator.com | 2024-04-02

Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864
Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...

duckdb

52 16,576 10.0 C++

DuckDB is an in-process SQL OLAP Database Management System

Project mention: 🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK | dev.to | 2024-04-04

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
doris

42 11,314 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

starrocks

12 7,726 10.0 Java

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

databend

32 7,184 10.0 Rust

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com

Project mention: Solutions to manage runaway Snowflake costs? | news.ycombinator.com | 2024-01-16

Databend vs. Snowflake: https://github.com/datafuselabs/databend/issues/13059

arrow-datafusion

55 4,924 9.9 Rust

Apache DataFusion SQL Query Engine

Project mention: Velox: Meta's Unified Execution Engine [pdf] | news.ycombinator.com | 2024-03-25

Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441

Crate

6 3,955 9.9 Java

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
heavydb

1 2,901 8.4 C++

HeavyDB (formerly OmniSciDB)
chdb

18 1,691 9.6 C++

chDB is an embedded OLAP SQL Engine 🚀 powered by ClickHouse

Project mention: FLaNK Stack Weekly 06 Nov 2023 | dev.to | 2023-11-06

matrixone

8 1,676 10.0 Go

Hyperconverged cloud-edge native database

Project mention: Push or Pull, is this a question? | dev.to | 2023-08-09

Source code：matrixorigin/matrixone: Hyperconverged cloud-edge native database (github.com)

risinglight

3 1,530 7.7 Rust

An educational OLAP database system.
Cubes

1 1,490 0.0 Python

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis
datafusion-ballista

12 1,275 8.4 Rust

Apache Arrow Ballista Distributed Query Engine

Project mention: Polars | news.ycombinator.com | 2024-01-08

Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

mondrian

2 1,120 5.4 Java

Mondrian is an Online Analytical Processing (OLAP) server that enables business users to analyze large quantities of data in real-time.
scratchdata

5 1,027 9.4 Go

Scratch is a swiss army knife for big data.

Project mention: Debugging a Golang Bug with Non-Blocking Reads | news.ycombinator.com | 2024-03-12

Go team does acknowledge [1] it as a bug, so there is some point here
However, that said, I wonder if OP (duckdb) could have written their solution [2] differently. Shouldn't they be able to select from a Pipe as well as Error channel simultaneously? (similar to how they are doing it inside here [3]). If not, I would have create a go-routine that does blocking read on the Pipe and then pass it on to another channel to select on.
[1] https://github.com/golang/go/issues/66239
[2] https://github.com/scratchdata/scratchdata/blob/7c1a0fcd0e20...
[3] https://github.com/scratchdata/scratchdata/blob/7c1a0fcd0e20...

kuzu

11 1,003 9.9 C++

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.

Project mention: Unum: Vector Search engine in a single file | news.ycombinator.com | 2023-07-31

duckdb-wasm

11 916 9.6 C++

WebAssembly version of DuckDB

Project mention: Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data | news.ycombinator.com | 2024-04-22

i think duckdb-wasm is closer to 6MB over wire, but ~36MB once decompressed. (see net panel when loading https://shell.duckdb.org/)
the decompressed size should be okay since it's not the same as parsing and JITing 36MB of JS.

stonedb

7 850 7.8 C++

StoneDB is an Open-Source MySQL HTAP and MySQL-Native DataBase for OLTP, Real-Time Analytics, a counterpart of MySQLHeatWave. (https://stonedb.io)
stanchion

3 616 8.9 C

A SQLite extension that brings column-oriented tables to SQLite

Project mention: Show HN: Stanchion – Column-oriented tables in SQLite | news.ycombinator.com | 2024-01-31

The "Data Storage Internals" section[1] of the README sounds to me like it has its own column-oriented format for these tables, at least that's how I'm reading the part about segments. Is that the case? If so, have you tried using Apache Arrow or Parquet to see how they compare?
[1] https://github.com/dgllghr/stanchion#data-storage-internals

ClickBench

68 567 9.1 HTML

ClickBench: a Benchmark For Analytical Databases

Project mention: Loading a trillion rows of weather data into TimescaleDB | news.ycombinator.com | 2024-04-16

TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience.
That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.
So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?
In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").
For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.
If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.
- https://github.com/ClickHouse/ClickBench/blob/main/clickhous...
There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.
So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.

inline-sql

3 412 3.4 Python

🪄 Inline SQL in any Python program
duckdb-rs

3 361 7.3 Rust

Ergonomic bindings to duckdb for Rust
KuiBaDB

2 311 0.0 Rust

Another OLAP database
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Olap related posts

🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK
1 project | dev.to | 4 Apr 2024
We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
1 project | news.ycombinator.com | 2 Apr 2024
Build time is a collective responsibility
2 projects | news.ycombinator.com | 24 Mar 2024
Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis
2 projects | dev.to | 27 Mar 2024
42.parquet – A Zip Bomb for the Big Data Age
1 project | news.ycombinator.com | 26 Mar 2024
DuckDB: Move to push-based execution model (2021)
1 project | news.ycombinator.com | 15 Mar 2024
Fair Benchmarking Considered Difficult (2018) [pdf]
2 projects | news.ycombinator.com | 10 Mar 2024
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Olap projects? This list will help you:

	Project	Stars
1	ClickHouse	34,054
2	duckdb	16,576
3	doris	11,314
4	starrocks	7,726
5	databend	7,184
6	arrow-datafusion	4,924
7	Crate	3,955
8	heavydb	2,901
9	chdb	1,691
10	matrixone	1,676
11	risinglight	1,530
12	Cubes	1,490
13	datafusion-ballista	1,275
14	mondrian	1,120
15	scratchdata	1,027
16	kuzu	1,003
17	duckdb-wasm	916
18	stonedb	850
19	stanchion	616
20	ClickBench	567
21	inline-sql	412
22	duckdb-rs	361
23	KuiBaDB	311