Using AWS and Hyperscan to match regular expressions on 100GB of text

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • smart_open

    Utils for streaming large files (S3, HDFS, gzip, bz2...)

  • If you didn’t follow along with the first article in this series, you should be able to follow this article with your own dataset as long as you install smart_open and Meadowrun. smart_open is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem, and Meadowrun makes it easy to run your Python code on the cloud.

  • RE2

    RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

  • Let’s try re2, which is a regular expression engine built by Google primarily with the goal of taking linear time to search a string for any regular expression. For context, python’s built-in re library uses a backtracking approach, which can take exponential time to search a string. re2 uses a Thompson NFA approach, which can guarantee the linear time search, but offers fewer features.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • hyperscan

    High-performance regular expression matching library

  • Our last stop is Hyperscan, which is a regular expression engine originally built with an eye towards deep packet inspection by a startup called Sensory Networks which was acquired by Intel in 2013. Hyperscan has a ton of really cool parts to it—there’s a good overview by a maintainer Geoff Langdale, and the paper goes into more depth. I’ll just highlight one of my favorites, which is its extensive use of SIMD instructions for searching strings.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts