Chunking strings in Elixir: how difficult can it be?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

bstr

10 744 6.7 Rust

A string type for Rust that is not required to be valid UTF-8.

As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time.
NSFL: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...

unicode_string

2 18 8.5 Elixir

String utilities based upon Unicode sets

In unicode_string if you have a look at how the segmentation is implemented (https://github.com/elixir-unicode/unicode_string/blob/master...) it will make a lot of calls to Regex.split and it does this as it moves along the string. It will be calling Regex.split on the remaining part of the string it needs to work on a lot. The problem is Regex.split as implemented in Elixir just runs the regex on the whole string looking for all the matches even if you include a limit on the number of parts so the segmentation algorithm is going to have quadratic complexity. Also, when there are unicode ranges in the regular expression it runs a lot slower. However, the erlang re engine is not validating whether the binary is valid utf8 or not before running the regular expression matching. These errors they were getting was because Regex.split() causes the erlang re engine to match the whole string.
For example:
    iex(86)> v = "£AXAX" <> String.duplicate("A", 10_000) <> "\xFF"; {:ok, r} = :re.compile("[£-¨]"); :timer.tc(fn() -> :re.run(v, r) end)

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

bstr 1.0 request for comments

2 projects | /r/rust | 5 Jul 2022
A byte string library for Rust

5 projects | news.ycombinator.com | 8 Sep 2022
uni-algo: Unicode Algorithms Implementation for C/C++

1 project | news.ycombinator.com | 25 Mar 2024
What are your favorite utility libraries?

1 project | /r/Zig | 21 Feb 2023
Where is the `str` struct/primitive defined ? I am learning Rust, so don't shoot please :).

3 projects | /r/rust | 29 Aug 2022

Chunking strings in Elixir: how difficult can it be?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Bytes Unicode byte-string utf-8 substring-search
Post date: 4 Jan 2023

bstr

unicode_string

InfluxDB

Related posts

bstr 1.0 request for comments

A byte string library for Rust

uni-algo: Unicode Algorithms Implementation for C/C++

What are your favorite utility libraries?

Where is the `str` struct/primitive defined ? I am learning Rust, so don't shoot please :).

Chunking strings in Elixir: how difficult can it be?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Bytes Unicode byte-string utf-8 substring-search Post date: 4 Jan 2023

bstr

unicode_string

InfluxDB

Related posts

bstr 1.0 request for comments

A byte string library for Rust

uni-algo: Unicode Algorithms Implementation for C/C++

What are your favorite utility libraries?

Where is the `str` struct/primitive defined ? I am learning Rust, so don't shoot please :).

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Bytes Unicode byte-string utf-8 substring-search
Post date: 4 Jan 2023