The Boyer-Moore Fast String Searching Algorithm

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
WorkOS - The modern identity platform for B2B SaaS
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
workos.com
featured
  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

  • It does actually! But only in some cases. See https://github.com/BurntSushi/ripgrep/issues/617#issuecommen... and https://github.com/rust-lang/regex/issues/408 and https://github.com/rust-lang/regex/pull/419

    Note though that it is going away soon. It's going to be replaced with a combination of Two-Way, Rabin-Karp and a fast vectorized prefilter based on this: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...

    The "secret sauce" is that in the vectorized prefilter, we don't just choose the first and last bytes. Instead, we pick bytes that we think will be rare based on a priori assumption of relative byte frequencies.

  • ripgrep

    ripgrep recursively searches directories for a regex pattern while respecting your gitignore

  • It does actually! But only in some cases. See https://github.com/BurntSushi/ripgrep/issues/617#issuecommen... and https://github.com/rust-lang/regex/issues/408 and https://github.com/rust-lang/regex/pull/419

    Note though that it is going away soon. It's going to be replaced with a combination of Two-Way, Rabin-Karp and a fast vectorized prefilter based on this: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...

    The "secret sauce" is that in the vectorized prefilter, we don't just choose the first and last bytes. Instead, we pick bytes that we think will be rare based on a priori assumption of relative byte frequencies.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • rust-memchr

    Optimized string search routines for Rust.

  • > I'm curious how you can speed up finding a single byte? Don't you have to touch all bytes at that point? Leaning heavily on simd instructions?

    Yes. That's what memchr does: https://github.com/BurntSushi/rust-memchr/blob/427fdc384007d...

    > And I realize newlines is only to track line counts. Am I wrong that that slows things down?

    It does, but only a little if it's implemented using vector instructions. You can try with ripgrep using -N/-n. GNU grep's line counting isn't vectorized IIRC, so you might notice bigger differences there.

  • fast_strstr

    A fast substitution to the stdlib's strstr() sub-string search function.

  • https://github.com/RaphaelJ/fast_strstr/tree/master/benchmar...

    While the algorithm is time linear (Boyer-Moore is sub linear), it ends up being significantly faster on textual content as the per character operations are significantly simplier.

  • fastsearch

    Implements a search algorithm I invented decades ago in 64 bit pascal

  • I went through the work, and re-implemented it as an assembly language routine inside a 64bit free pascal program.

    It searches about 1 gigabyte of text in 2 seconds.

    https://github.com/mikewarot/fastsearch/blob/master/fastsear...

  • smart

    String Matching Algorithms Research Tool (by smart-tool)

  • In the libc's not, indeed. They are still in the stoneage of string search.

    But the fastest is currently EPSM, by S. Faro and O. M. Kulekci. See https://smart-tool.github.io/smart/

    "Exact Packed String Matching" optimized for SIMD SSE4.2/AVX (x86_64 and aarch64). It performs stable and best on all sizes.

    The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.

  • boyermoore

    Implementation of Boyer-Moore fast string search algorithm in Go

  • I implemented it in Go for fun. I expected Go's standard lib will beat it but it perform around 10 times faster than `strings.Index`.

    https://github.com/sarpdag/boyermoore

    Go's strings package depends on the substring size tries different approaches like Rabin Karp algorithm, (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Website Question

    3 projects | /r/nfl | 15 Nov 2022
  • Be careful of the examples you use. They stick

    2 projects | news.ycombinator.com | 22 Aug 2023
  • How do I learn grep in detail. I work at support and day to day need to have to deal with heavy logs. I have seen people playing with operators and other stuff to get the desired output. Someone was using Sed also. Please help.🙏

    2 projects | /r/linuxquestions | 20 Mar 2022
  • Ripgrep

    1 project | news.ycombinator.com | 25 Feb 2024
  • CryptoFlow: Building a secure and scalable system with Axum and SvelteKit - Part 3

    3 projects | dev.to | 8 Jan 2024