-
regex
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
-
ripgrep
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
It does actually! But only in some cases. See https://github.com/BurntSushi/ripgrep/issues/617#issuecommen... and https://github.com/rust-lang/regex/issues/408 and https://github.com/rust-lang/regex/pull/419
Note though that it is going away soon. It's going to be replaced with a combination of Two-Way, Rabin-Karp and a fast vectorized prefilter based on this: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...
The "secret sauce" is that in the vectorized prefilter, we don't just choose the first and last bytes. Instead, we pick bytes that we think will be rare based on a priori assumption of relative byte frequencies.
It does actually! But only in some cases. See https://github.com/BurntSushi/ripgrep/issues/617#issuecommen... and https://github.com/rust-lang/regex/issues/408 and https://github.com/rust-lang/regex/pull/419
Note though that it is going away soon. It's going to be replaced with a combination of Two-Way, Rabin-Karp and a fast vectorized prefilter based on this: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...
The "secret sauce" is that in the vectorized prefilter, we don't just choose the first and last bytes. Instead, we pick bytes that we think will be rare based on a priori assumption of relative byte frequencies.
> I'm curious how you can speed up finding a single byte? Don't you have to touch all bytes at that point? Leaning heavily on simd instructions?
Yes. That's what memchr does: https://github.com/BurntSushi/rust-memchr/blob/427fdc384007d...
> And I realize newlines is only to track line counts. Am I wrong that that slows things down?
It does, but only a little if it's implemented using vector instructions. You can try with ripgrep using -N/-n. GNU grep's line counting isn't vectorized IIRC, so you might notice bigger differences there.
https://github.com/RaphaelJ/fast_strstr/tree/master/benchmar...
While the algorithm is time linear (Boyer-Moore is sub linear), it ends up being significantly faster on textual content as the per character operations are significantly simplier.
I went through the work, and re-implemented it as an assembly language routine inside a 64bit free pascal program.
It searches about 1 gigabyte of text in 2 seconds.
https://github.com/mikewarot/fastsearch/blob/master/fastsear...
In the libc's not, indeed. They are still in the stoneage of string search.
But the fastest is currently EPSM, by S. Faro and O. M. Kulekci. See https://smart-tool.github.io/smart/
"Exact Packed String Matching" optimized for SIMD SSE4.2/AVX (x86_64 and aarch64). It performs stable and best on all sizes.
The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.
I implemented it in Go for fun. I expected Go's standard lib will beat it but it perform around 10 times faster than `strings.Index`.
https://github.com/sarpdag/boyermoore
Go's strings package depends on the substring size tries different approaches like Rabin Karp algorithm, (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)
Related posts
-
Website Question
-
Be careful of the examples you use. They stick
-
How do I learn grep in detail. I work at support and day to day need to have to deal with heavy logs. I have seen people playing with operators and other stuff to get the desired output. Someone was using Sed also. Please help.🙏
-
Ripgrep
-
CryptoFlow: Building a secure and scalable system with Axum and SvelteKit - Part 3