-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time.
NSFL: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...
In unicode_string if you have a look at how the segmentation is implemented (https://github.com/elixir-unicode/unicode_string/blob/master...) it will make a lot of calls to Regex.split and it does this as it moves along the string. It will be calling Regex.split on the remaining part of the string it needs to work on a lot. The problem is Regex.split as implemented in Elixir just runs the regex on the whole string looking for all the matches even if you include a limit on the number of parts so the segmentation algorithm is going to have quadratic complexity. Also, when there are unicode ranges in the regular expression it runs a lot slower. However, the erlang re engine is not validating whether the binary is valid utf8 or not before running the regular expression matching. These errors they were getting was because Regex.split() causes the erlang re engine to match the whole string.
For example:
iex(86)> v = "£AXAX" <> String.duplicate("A", 10_000) <> "\xFF"; {:ok, r} = :re.compile("[£-¨]"); :timer.tc(fn() -> :re.run(v, r) end)