Chunking strings in Elixir: how difficult can it be?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • bstr

    A string type for Rust that is not required to be valid UTF-8.

  • As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time.

    NSFL: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...

  • unicode_string

    String utilities based upon Unicode sets

  • In unicode_string if you have a look at how the segmentation is implemented (https://github.com/elixir-unicode/unicode_string/blob/master...) it will make a lot of calls to Regex.split and it does this as it moves along the string. It will be calling Regex.split on the remaining part of the string it needs to work on a lot. The problem is Regex.split as implemented in Elixir just runs the regex on the whole string looking for all the matches even if you include a limit on the number of parts so the segmentation algorithm is going to have quadratic complexity. Also, when there are unicode ranges in the regular expression it runs a lot slower. However, the erlang re engine is not validating whether the binary is valid utf8 or not before running the regular expression matching. These errors they were getting was because Regex.split() causes the erlang re engine to match the whole string.

    For example:

        iex(86)> v = "£AXAX" <> String.duplicate("A", 10_000) <> "\xFF"; {:ok, r} = :re.compile("[£-¨]"); :timer.tc(fn() -> :re.run(v, r) end)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • bstr 1.0 request for comments

    2 projects | /r/rust | 5 Jul 2022
  • A byte string library for Rust

    5 projects | news.ycombinator.com | 8 Sep 2022
  • uni-algo: Unicode Algorithms Implementation for C/C++

    1 project | news.ycombinator.com | 25 Mar 2024
  • What are your favorite utility libraries?

    1 project | /r/Zig | 21 Feb 2023
  • Where is the `str` struct/primitive defined ? I am learning Rust, so don't shoot please :).

    3 projects | /r/rust | 29 Aug 2022