StringZilla vs compiler-explorer

StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖 (by ashvardanian)

Source Code

ashvardanian.com

Suggest alternative

Edit details

compiler-explorer

Run compilers interactively from your web browser and interact with the assembly (by compiler-explorer)

Rust C++ Go Dlang Compiler CPP Assembly ispc Haskell Swift rust-lang HacktoberFest

Source Code

godbolt.org

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

StringZilla		compiler-explorer
	Project
14	Mentions	191
1,811	Stars	15,238
-	Growth	1.8%
9.8	Activity	9.9
15 days ago	Latest Commit	5 days ago
C++	Language	TypeScript
Apache License 2.0	License	BSD 2-clause "Simplified" License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

StringZilla

Posts with mentions or reviews of StringZilla. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-12-27.

Measuring energy usage: regular code vs. SIMD code
1 project | news.ycombinator.com | 19 Feb 2024

The 3.5x energy-efficiency gap between serial and SIMD code becomes even larger when
A. you do byte-level processing instead of float words;
B. you use embedded, IoT, and other low-energy devices.
A few years ago I've compared Nvidia Jetson Xavier (long before the Orin release), Intel-based MacBook Pro with Core i9, and AVX-512 capable CPUs on substring search benchmarks.
On Xavier one can quite easily disable/enable cores and reconfigure power usage. At peak I got to 4.2 GB/J which was an 8.3x improvement in inefficiency over LibC in substring search operations. The comparison table is still available in the older README: https://github.com/ashvardanian/StringZilla/tree/v2.0.2?tab=...
Show HN: StringZilla v3 with C++, Rust, and Swift bindings, and AVX-512 and NEON
1 project | news.ycombinator.com | 7 Feb 2024
How fast is rolling Karp-Rabin hashing?
1 project | news.ycombinator.com | 4 Feb 2024

This is extremely timely! I was working on SIMD variants for collision-resistant rolling-hash variants in the last few weeks for the v3 release of the StringZilla library [1].
I have tried several 4-way and 8-way parallel variants using AVX-512 DQ instructions for 64-bit integer multiplications [2] as well as using integer FMA instructions on Arm NEON with 32-bit multiplications [3]. The latter needs a better mixing approach to be collision-resistant.
So far I couldn't exceed 1 GB/s/core [4], so more research is needed. If you have any ideas - I am all ears!
[1]: https://github.com/ashvardanian/StringZilla/blob/bc1869a8529...
[2]: https://github.com/ashvardanian/StringZilla/blob/bc1869a8529...
[3]: https://github.com/ashvardanian/StringZilla/blob/bc1869a8529...
[4]: https://github.com/ashvardanian/StringZilla/tree/main-dev?ta...
4B If Statements
5 projects | news.ycombinator.com | 27 Dec 2023

Jokes aside, lookup tables are a common technique to avoid costly operations. I was recently implementing one to avoid integer division. In my case I knew that the nominator and denominator were 8 bit unsigned integers, so I've replaced the division with 2 table lookups and 6 shifts and arithmetic operations [1]. The well known `libdivide` [2] does that for arbitrary 16, 32, and 64 bit integers, and it has precomputed magic numbers and lookup tables for all 16-bit integers in the same repo.
[1]: https://github.com/ashvardanian/StringZilla/blob/9f6ca3c6d3c...
Python, C, Assembly – Faster Cosine Similarity
5 projects | news.ycombinator.com | 18 Dec 2023

That matches my experience, and goes beyond GCC and Clang. Between 2018 and 2020 I was giving a lot of lectures on this topic and we did a bunch of case studies with Intel on their older ICC and what later became the OneAPI.
Short story, unless you are doing trivial data-parallel operations, like in SimSIMD, compilers are practically useless. As a proof, I wrote what is now the StringZilla library (https://github.com/ashvardanian/stringzilla) and we've spent weeks with an Intel team, tuning the compiler, no result. So if you are processing a lot of strings, or variable-length coded data, like compression/decompression, hand-written SIMD kernels are pretty much unbeatable.
Stringzilla: 10x Faster SIMD-accelerated String Class
1 project | /r/programming | 30 Aug 2023
Stringzilla: 10x faster SIMD-accelerated Python `str` class
2 projects | /r/Python | 30 Aug 2023

Blog post
Stringzilla: Fastest string sort, search, split, and shuffle using SIMD
9 projects | news.ycombinator.com | 29 Aug 2023

Copying my feedback from reddit[1], where I discussed it in the context of the `memchr` crate.[2]
I took a quick look at your library implementation and have some notes:
* It doesn't appear to query CPUID, so I imagine the only way it uses AVX2 on x86-64 is if the user compiles with that feature enabled explicitly. (Or uses something like [`x86-64-v3`](https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...).) The `memchr` crate doesn't need that. It will use AVX2 even if the program isn't compiled with AVX2 enabled so long as the current CPU supports it.
* Your substring routines have multiplicative worst case (that is, `O(m * n)`) running time. The `memchr` crate only uses SIMD for substring search for smallish needles. Otherwise it flips over to Two-Way with a SIMD prefilter. You'll be fine for short needles, but things could go very very badly for longer needles.
* It seems quite likely that your [confirmation step](https://github.com/ashvardanian/Stringzilla/blob/fab854dc4fd...) is going to absolutely kill performance for even semi-frequently occurring candidates. The [`memchr` crate utilizes information from the vector step to limit where and when it calls `memcmp`](https://github.com/BurntSushi/memchr/blob/46620054ff25b16d22...). Your code might do well in cases where matches are very rare. I took a quick peek at your benchmarks and don't see anything that obviously stresses this particular case. For substring search, the `memchr` crate uses a variant of the "[generic SIMD](http://0x80.pl/articles/simd-strfind.html#first-and-last)" algorithm. Basically, it takes two bytes from the needle, looks for positions where those occur and then attempts to check whether that position corresponds to a match. It looks like your technique uses the first 4 bytes. I suspect that might be overkill. (I did try using 3 bytes from the needle and found that it was a bit slower in some cases.) That is, two bytes is usually enough predictive power to lower the false positive rate enough. Of course, one can write pathological inputs that cause either one to do better than the other. (The `memchr` crat benchmark suite has a [collection of pathological inputs](https://github.com/BurntSushi/memchr/blob/46620054ff25b16d22...).)
It would actually be possible to hook Stringzilla up to `memchr`'s benchmark suite if you were interested. :-)
[1]: https://old.reddit.com/r/rust/comments/163ph8r/memchr_26_now...
[2]: https://github.com/BurntSushi/memchr
Show HN: Faking SIMD to Search and Sort Strings 5x Faster
1 project | news.ycombinator.com | 26 Aug 2023

I took a look at Stringzilla (https://github.com/ashvardanian/stringzilla), and in addition to the impressive benchmarks, the API looks pretty straightforward. It's a new star in my collection!
Thanks for open-sourcing this project!

compiler-explorer

Posts with mentions or reviews of compiler-explorer. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-04-28.

What if null was an Object in Java?
3 projects | news.ycombinator.com | 28 Apr 2024

At least on android arm64, looks like a `dmb ishst` is emitted after the constructor, which allows future loads to not need an explicit barrier. Removing `final` from the field causes that barrier to not be emitted.
https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
Ask HN: Which books/resources to understand modern Assembler?
6 projects | news.ycombinator.com | 21 Apr 2024
3rd Edition of Programming: Principles and Practice Using C++ by Stroustrup
6 projects | news.ycombinator.com | 19 Apr 2024

You said You won't get "extreme performance" from C++ because it is buried under the weight of decades of compatibility hacks.
Now your whole comment is about vector behavior. You haven't talked about what 'decades of compatibility hacks' are holding back performance. Whatever behavior you want from a vector is not a language limitation.
You could write your own vector and be done with it, although I'm still not sure what you mean, since once you reserve capacity a vector still doubles capacity when you overrun it. The reason this is never a performance obstacle is that if you're going to use more memory anyway, you reserve more up front. This is what any normal programmer does and they move on.
Show what you mean here:
https://godbolt.org/
I've never used ISPC. It's somewhat interesting although since it's Intel focused of course it's not actually portable.
I guess now the goal posts are shifting. First it was that "C++ as a language has performance limitations" now it's "rust has a vector that has a function I want and also I want SIMD stuff that doesn't exist. It does exist? not like that!"
Try to stay on track. You said there were "decades of compatibility hacks" holding back C++ performance then you went down a rabbit hole that has nothing to do with supporting that.
C++ Insights – See your source code with the eyes of a compiler
5 projects | news.ycombinator.com | 5 Apr 2024

C++ Insights is available online at https://cppinsights.io/
It is also available at a touch of a button within the most excellent https://godbolt.org/
along side the button that takes your code sample to https://quick-bench.com/
Those sites and https://cppreference.com/ are what I'm using constantly while coding.
I recently discovered https://whitebox.systems/ It's a local app with a $69 one-time charge. And, it only really works with "C With Classes" style functions. But, it looks promising as another productivity boost.
Ask HN: How can I learn about performance optimization?
6 projects | news.ycombinator.com | 2 Mar 2024

[P&H RISC] https://www.google.com/books/edition/_/e8DvDwAAQBAJ
Compiler Explorer by Matt Godbolt [Godbolt] can help better understand what code a compiler generates under different circumstances.
[Godbolt] https://godbolt.org
The official CPU architecture manuals from CPU vendors are surprisingly readable and information-rich. I only read the fragments that I need or that I am interested in and move on. Here is the Intel’s one [Intel]. I use the Combined Volume Set, which is a huge PDF comprising all the ten volumes. It is easier to search in when it’s all in one file. I can open several copies on different pages to make navigation easier.
Intel also has a whole optimization reference manual [Intel] (scroll down, it’s all on the same page). The manual helps understand what exactly the CPU is doing.
[Intel] https://www.intel.com/content/www/us/en/developer/articles/t...
Personally, I believe in automated benchmarks that measure end-to-end what is actually important and notify you when a change impacts performance for the worse.
Managing mutable data in Elixir with Rust
1 project | news.ycombinator.com | 16 Feb 2024
Let's compile it with https://godbolt.org/, turn on some optimisations and inspect the IR (-O2 -emit-llvm). Copying out the part that corresponds to the while loop:
```
  4:
```
Free MIT Course: Performance Engineering of Software Systems
4 projects | news.ycombinator.com | 10 Jan 2024

resources were extra useful when building deeper intuitions about GPU performance for ML models at work and in graduate school.
- CMU's "Deep Learning Systems" Course is hosted online and has YouTube lectures online. While not generally relevant to software performance, it is especially useful for engineers interested in building strong fundamentals that will serve them well when taking ML models into production environments: https://dlsyscourse.org/
- Compiler Explorer is a tool that allows you easily input some code in and check how the assembly output maps to the source. I think this is exceptionally useful for beginner/intermediate programmers who are familiar with one compiled high-level language and have not been exposed to reading lots of assembly. It is also great for testing how different compiler flags affect assembly output. Many people used to coding in C and C++ probably know about this, but I still run into people who haven't so I share it whenever performance comes up: https://godbolt.org/
Verifying Rust Zeroize with Assembly...including portable SIMD
1 project | dev.to | 10 Jan 2024

To really understand what's going on here we can look at the compiled assembly code. I'm working on a Mac and can do this using the objdump tool. Compiler Explorer is also a handy tool but doesn't seem to support Arm assembly which is what Rust will use when compiling on Apple Silicon.
4B If Statements
5 projects | news.ycombinator.com | 27 Dec 2023
Operator precedence doubt
1 project | /r/cprogramming | 11 Dec 2023

Play around with it in godbolt if you're really curious: https://godbolt.org/

What are some alternatives?

When comparing StringZilla and compiler-explorer you can also consider the following projects:

usearch - Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍

C++ Format - A modern formatting library

Simd - C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.

rust - Empowering everyone to build reliable and efficient software.

aho-corasick - A fast implementation of Aho-Corasick in Rust.

format-benchmark - A collection of formatting benchmarks

rust-memchr - Optimized string search routines for Rust.

papers - ISO/IEC JTC1 SC22 WG21 paper scheduling and management

popular-baby-names - 1, 000 most popular names for baby boys and girls in CSV and JSON formats. Generator written in Python.

rustc_codegen_gcc - libgccjit AOT codegen for rustc

rebar - A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.

firejail - Linux namespaces and seccomp-bpf sandbox

StringZilla vs usearch compiler-explorer vs C++ Format StringZilla vs Simd compiler-explorer vs rust StringZilla vs aho-corasick compiler-explorer vs format-benchmark StringZilla vs rust-memchr compiler-explorer vs papers StringZilla vs popular-baby-names compiler-explorer vs rustc_codegen_gcc StringZilla vs rebar compiler-explorer vs firejail

Compare StringZilla vs compiler-explorer and see what are their differences.

StringZilla

compiler-explorer

StringZilla

compiler-explorer

What are some alternatives?