sneller vs ClickHouse

sneller

World's fastest log analysis: λ + SQL + JSON + S3 (by SnellerInc)

Source Code

sneller.ai

Suggest alternative

Edit details

ClickHouse

ClickHouse® is a free analytics DBMS for big data (by ClickHouse)

Database Dbms Olap Analytics SQL distributed-database Big Data Mpp Clickhouse HacktoberFest

Source Code

clickhouse.com

Docs

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

sneller		ClickHouse
	Project
15	Mentions	208
969	Stars	34,269
0.7%	Growth	1.3%
9.1	Activity	10.0
4 months ago	Latest Commit	about 12 hours ago
Go	Language	C++
GNU General Public License v3.0 or later	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

sneller

Posts with mentions or reviews of sneller. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-05-31.

OSS: Relicense to Apache 2 Globally
1 project | news.ycombinator.com | 23 Mar 2024
Iguana: fast SIMD-optimized decompression
7 projects | news.ycombinator.com | 31 May 2023

Looks like they switched from AGPL to Apache 2 last week?
https://github.com/SnellerInc/sneller/commit/05410a85f900e02...
Sneller: SQL for JSON at Scale
1 project | news.ycombinator.com | 18 May 2023
Sneller Regex vs Ripgrep
3 projects | news.ycombinator.com | 18 May 2023

And that is the primary reason why ripgrep doesn't bother with AVX-512. Not because of some lack of skill as this blog suggests:
> Additionally, ripgrep uses AVX2 and does not take advantage of AVX-512 instruction sets, but this can be forgiven given the specialized skills required for handcrafting for SkylakeX and Icelake/Zen4 processors.
Namely, I tried running sneller on my CPU, which is a pretty recent i9-12900K, and not even it supports AVX-512. That's because Intel has been dropping support for AVX-512 from its more recent consumer grade CPUs. ripgrep is running far more frequently on consumer grade CPUs, so supporting AVX-512 is probably not particularly advantageous. At least, it's not obvious to me that it's worth doing. And certainly, the skill argument isn't entirely wrong. I'd have to invest developer time to make it work.
I think there are two other things worth highlighting from this blog.
First is that sneller seems to do quite well with compressed data. This is definitely not ripgrep's strong suit. When you use ripgrep's -z/--search-zip flag, all it's doing is shelling out to your gzip/xz/whatever executable to do the decompression work, which is then streamed into ripgrep for searching. So if your search speed tanks when using -z/--search-zip, it's likely because your decompression tools are slow, not because of ripgrep. But it's a fair comparison from sneller's perspective, because it seems to integrate the two.
Second is the issue of multi-threaded search. In ripgrep, the fundamental unit of work is "search a file." ripgrep has no support for more granular parallelism. That is, if you give it one file, it's limited to doing a single threaded search. ripgrep could do more granular parallelism, but it hasn't been obviously worth it to me. If most searches are on a directory tree, then parallelizing at the level of each file is almost certainly good enough. Making ripgrep's parallelism more fine grained is a fair bit of work too, and there would be a lot of fiddly stuff to get right. If I could run sneller easily, I'd probably try to see how it does in a more varied workload than what is presented in this blog. :-)
And finally some corrections:
> However, when using a single thread, ripgrep appears to be slightly faster.
Not just slightly faster, over 2x faster!
The single threaded results for Regex2 and Regex3 for Sneller are quite nice! I'd be interested in hearing more about what you're doing in the Regex2 case, since Sneller and ripgrep are about on par with the Regex3 case. Maybe a fail fast optimization?
> The reason for this is that ripgrep uses the Boyer-Moore string search algorithm, which is a pattern matching algorithm that is often used for searching for substrings within larger strings. It is particularly efficient when the pattern being searched for is relatively long and the alphabet of characters being searched over is relatively small. Sneller does not use this substring search algorithm and as a result is slower than ripgrep with substrings. However, when long substrings are not present, Sneller outperforms ripgrep.
ripgrep has never used Boyer-Moore. (Okay, some years ago, ripgrep could use Boyer-Moore in certain niche cases. But that hasn't been the case for a while and it was never the thing most commonly used). What ripgrep uses today is succinctly described here: https://github.com/BurntSushi/memchr#algorithms-used (But it has always eschewed algorithms like Boyer-Moore in favor of more heuristic-y approaches based on a background frequency distribution of bytes.)
I think I would also contest the claim that "long substrings" are the key here. ripgrep is plenty fast with short substrings too. You're correct that if you have no literals then ripgrep will get slower because it has to fall back to the regex engine. But I'd like to see more robust benchmarks there. Your Regex2 and Regex3 benchmarks raise more questions than it answers. :-)
> Although the resulting .dot and .svg files may be somewhat clunky, we can still observe from the graph that the number of nodes and edges are small enough to use the branchless IceLake implementation. In this particular case, we only need 8 bits to encode the number of nodes and the number of distinct edges, enabling the tool to use (what we call) the 8-bit DFA implementation. For more details on how this works, see our post on regex implementations.
So this is talking about the DFA graph for the regex `Sherlock [A-Z]\w+`. It's important to point out that, in ripgrep, `\w` is Unicode aware by default. Which makes it absolutely enormous. So I think the state graph you linked is probably only for the ASCII version of that regex.
Indeed, reading your regex blog[1], it perhaps looks like a lot of the tricks you use won't work for Unicode, because Unicode tends to blow up finite automata.
If I could run Sneller, I'd probably try to poke it to see what its Unicode support looks like. From a quick glance of the source code, it also looks like you build full DFAs. So I would also try to poke it to see what happens when handed a particularly a not-so-small regex. (Building a DFA can take quite some time.)
Ah okay, I see, you put a max limit on the DFA: https://github.com/SnellerInc/sneller/blob/bb5adec564bf9869d...
Overall this is a very cool project!
[1]: https://sneller.io/blog/accelerating-regex-using-avx-512/
Sneller: Vectorized SQL for JSON at scale: fast, simple, schemaless
1 project | news.ycombinator.com | 4 Oct 2022

1 project | news.ycombinator.com | 17 May 2022
Lesser Known Features of ClickHouse
6 projects | news.ycombinator.com | 31 May 2022

Thanks for sharing this. It is a very interesting problem that highlights some of the technical challenges of working with modern event data, which happens to 'prefer' being semi-structured (i.e JSON is the most natural serialization format while creating events).
It's also something we're working on! Shameless plug - I happen to work at Sneller (sneller.io, open source at https://github.com/SnellerInc/sneller) that might be interesting to you.
A couple of key ideas - first, we bypass the need for any sort of 'semi-structured to relational' ETL/ELT overhead by running vectorized SQL on a (compressed) binary form of the JSON data which preserves its original structure. So we're schema-on-read first and foremost - you don't need to worry about adding new fields in the source JSON as long as your queries know of these new fields.
Second, we completely separate storage from compute. Unlike CH we don't use local disk as any sort of storage tier, and use cloud object stores as our _primary_ storage tier. So all your data (including the compressed binary version of your source JSON) lives in s3 buckets in your control.
Feel free to check us out and let us know what you think!
Github - https://github.com/SnellerInc/sneller
Sneller: Building a SQL VM in AVX-512 Assembly
2 projects | /r/golang | 25 May 2022

You can check it out at https://github.com/SnellerInc/sneller and run it yourself.
Accelerated SQL for JSON with AVX512 (Golang)
1 project | news.ycombinator.com | 17 May 2022
Ask HN: Any project ideas for a newbie to x64 assembler?
1 project | news.ycombinator.com | 17 May 2022

Albeit not using gas, you may want to check out https://github.com/SnellerInc/sneller -- it has about ~250 primitives written in AVX-512.

ClickHouse

Posts with mentions or reviews of ClickHouse. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-24.

We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
1 project | news.ycombinator.com | 2 Apr 2024

Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864
Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...
Build time is a collective responsibility
2 projects | news.ycombinator.com | 24 Mar 2024

In our repository, I've set up a few hard limits: each translation unit cannot spend more than a certain amount of memory for compilation and a certain amount of CPU time, and the compiled binary has to be not larger than a certain size.
When these limits are reached, the CI stops working, and we have to remove the bloat: https://github.com/ClickHouse/ClickHouse/issues/61121
Although these limits are too generous as of today: for example, the maximum CPU time to compile a translation unit is set to 1000 seconds, and the memory limit is 5 GB, which is ridiculously high.
Fair Benchmarking Considered Difficult (2018) [pdf]
2 projects | news.ycombinator.com | 10 Mar 2024

I have a project dedicated to this topic: https://github.com/ClickHouse/ClickBench
It is important to explain the limitations of a benchmark, provide a methodology, and make it reproducible. It also has to be simple enough, otherwise it will not be realistic to include a large number of participants.
I'm also collecting all database benchmarks I could find: https://github.com/ClickHouse/ClickHouse/issues/22398
How to choose the right type of database
15 projects | dev.to | 28 Feb 2024

ClickHouse: A fast open-source column-oriented database management system. ClickHouse is designed for real-time analytics on large datasets and excels in high-speed data insertion and querying, making it ideal for real-time monitoring and reporting.
Writing UDF for Clickhouse using Golang
2 projects | dev.to | 27 Feb 2024

Today we're going to create an UDF (User-defined Function) in Golang that can be run inside Clickhouse query, this function will parse uuid v1 and return timestamp of it since Clickhouse doesn't have this function for now. Inspired from the python version with TabSeparated delimiter (since it's easiest to parse), UDF in Clickhouse will read line by line (each row is each line, and each text separated with tab is each column/cell value):
The 2024 Web Hosting Report
37 projects | dev.to | 20 Feb 2024

For the third, examples here might be analytics plugins in specialized databases like Clickhouse, data-transformations in places like your ETL pipeline using Airflow or Fivetran, or special integrations in your authentication workflow with Auth0 hooks and rules.
Choosing Between a Streaming Database and a Stream Processing Framework in Python
10 projects | dev.to | 10 Feb 2024

Online analytical processing (OLAP) databases like Apache Druid, Apache Pinot, and ClickHouse shine in addressing user-initiated analytical queries. You might write a query to analyze historical data to find the most-clicked products over the past month efficiently using OLAP databases. When contrasting with streaming databases, they may not be optimized for incremental computation, leading to challenges in maintaining the freshness of results. The query in the streaming database focuses on recent data, making it suitable for continuous monitoring. Using streaming databases, you can run queries like finding the top 10 sold products where the “top 10 product list” might change in real-time.
Proton, a fast and lightweight alternative to Apache Flink
7 projects | news.ycombinator.com | 30 Jan 2024

Proton is a lightweight streaming processing "add-on" for ClickHouse, and we are making these delta parts as standalone as possible. Meanwhile contributing back to the ClickHouse community can also help a lot.
Please check this PR from the proton team: https://github.com/ClickHouse/ClickHouse/pull/54870
1 billion rows challenge in PostgreSQL and ClickHouse
1 project | dev.to | 18 Jan 2024

curl https://clickhouse.com/ | sh
We Executed a Critical Supply Chain Attack on PyTorch
6 projects | news.ycombinator.com | 14 Jan 2024

But I continue to find garbage in some of our CI scripts.
Here is an example: https://github.com/ClickHouse/ClickHouse/pull/58794/files
The right way is to:
- always pin versions of all packages;

What are some alternatives?

When comparing sneller and ClickHouse you can also consider the following projects:

Turbo-Base64 - Turbo Base64 - Fastest Base64 SIMD:SSE/AVX2/AVX512/Neon/Altivec - Faster than memcpy!

loki - Like Prometheus, but for logs.

aports - [MIRROR] Alpine packages build scripts

duckdb - DuckDB is an in-process SQL OLAP Database Management System

incubator-devlake - Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

blogs - blogs about sneller

VictoriaMetrics - VictoriaMetrics: fast, cost-effective monitoring solution and time series database

LZSSE - LZ77/LZSS designed for SSE based decompression

TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

tlog - Terminal I/O logger

datafusion - Apache DataFusion SQL Query Engine

sneller vs Turbo-Base64 ClickHouse vs loki sneller vs aports ClickHouse vs duckdb sneller vs incubator-devlake ClickHouse vs Trino sneller vs blogs ClickHouse vs VictoriaMetrics sneller vs LZSSE ClickHouse vs TimescaleDB sneller vs tlog ClickHouse vs datafusion

Compare sneller vs ClickHouse and see what are their differences.

sneller

ClickHouse

sneller

ClickHouse

What are some alternatives?