Our great sponsors
-
RE2
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
If you didn’t follow along with the first article in this series, you should be able to follow this article with your own dataset as long as you install smart_open and Meadowrun. smart_open is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem, and Meadowrun makes it easy to run your Python code on the cloud.
Let’s try re2, which is a regular expression engine built by Google primarily with the goal of taking linear time to search a string for any regular expression. For context, python’s built-in re library uses a backtracking approach, which can take exponential time to search a string. re2 uses a Thompson NFA approach, which can guarantee the linear time search, but offers fewer features.
Our last stop is Hyperscan, which is a regular expression engine originally built with an eye towards deep packet inspection by a startup called Sensory Networks which was acquired by Intel in 2013. Hyperscan has a ton of really cool parts to it—there’s a good overview by a maintainer Geoff Langdale, and the paper goes into more depth. I’ll just highlight one of my favorites, which is its extensive use of SIMD instructions for searching strings.