Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
octosql
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
I did say about
> Query language, data types support, feature completeness, stability and testing
nothing about correctness.
In terms of stability, I see a couple of pretty old, and still unresolved issues about memory safety (data races, segmentation faults) in your repository, found by users.
In contrast, most of the memory safety issues in ClickHouse are found by continuous fuzzing before the release. And finding similar issues will give you a reward: https://github.com/ClickHouse/ClickHouse/issues/38986
Our testing system successfully finding issues in well known and widely used libraries - jemalloc, rocksdb, grpc, AWS, Arrow, Avro, ZooKeeper, Linux kernel... It is kind of surprising, and it makes an impression like we are the only product that does testing for real.
I also remember an example of using SQLancer from 1.5 years ago. When SQLancer appeared, we started to use it on ClickHouse, and it has found a few issues and one crash. At the same time, it has found a lot of crashes in DuckDB. But this example is very old, and DuckDB evolved a lot since then - it is a much younger technology after all.
This summer I was preparing the ClickBench: https://benchmark.clickhouse.com/
When I tried to use DuckDB on the same dataset as ClickHouse, it simply did not work due to OOM: https://github.com/duckdb/duckdb/issues/3969
I also told them about our experience of using various memory allocators, and why you should never use the GLibC's malloc.
This issue was fixed.
This summer I was preparing the ClickBench: https://benchmark.clickhouse.com/
When I tried to use DuckDB on the same dataset as ClickHouse, it simply did not work due to OOM: https://github.com/duckdb/duckdb/issues/3969
I also told them about our experience of using various memory allocators, and why you should never use the GLibC's malloc.
This issue was fixed.
Congrats on the Show HN!
It's great to see more tools in this area (querying data from various sources in-place) and the Lambda use case is a really cool idea!
I've recently done a bunch of benchmarking, including ClickHouse Local and the usage was straightforward, with everything working as it's supposed to.
Just to comment on the performance area though, one area I think ClickHouse could still possibly improve on - vs OctoSQL[0] at least - is that it seems like the JSON datasource is slower, especially if only a small part of the JSON objects is used. If only a single field of many is used, OctoSQL lazily parses only that field, and skips the others, which yields non-trivial performance gains on big JSON files with small queries.
Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM books.json` with the Amazon Review Dataset, OctoSQL is twice as fast (3s vs 6s). That's a minor thing though (OctoSQL will slow down for more complicated queries, while for ClickHouse decoding the input is and remains the bottleneck).
As the author of textql ( https://github.com/dinedal/textql ) - thanks for the shoutout!
Looks great, I love more options in the space for CLI based data analysis tools! Fantastic work!
I think they're talking about https://github.com/harelba/q, which is not very fast.
Related posts
- 🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK
- We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
- Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis
- 42.parquet – A Zip Bomb for the Big Data Age
- DuckDB: Move to push-based execution model (2021)