Let's Stop Ascribing Meaning to Code Points (2017)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • language-server-protocol

    Defines a common protocol for language servers.

  • Yup, code points are the way to go. The LSP is a great example of what happens otherwise [1]. The LSP authors made the controversial decision to sync documents using line/columns and UTF-16 code unit offsets rather than relying on code points. I fear they dug the hole deeper by expanding the set of supported encodings to UTF-8 and 32 rather than standardizing on code points. I can only hope this is treated as a stop gap and a proper switch to code points happens in the next major revision.

    It's not immediately obvious, but line/columns are not a good synchronization method either. Unicode has their own definition of what constitutes a line terminator [2], but different programming languages and environments might have their own definition.

    [1] https://github.com/microsoft/language-server-protocol/issues...

    [2] https://en.wikipedia.org/wiki/Newline#Unicode

  • aho-corasick

    A fast implementation of Aho-Corasick in Rust.

  • This is just an FYI. I don't mean to say much to your overall point, although, as someone else who has spent a lot of time doing Unicode-y things, I do tend to agree with you. I had a very similar discussion a bit ago.[1]

    Putting that aside, at least with respect to grapheme segmentation, it might be a little simpler than you think. But maybe only a little. The unicode-segmentation crate also does word segmentation, which is quite a bit more complicated than grapheme segmentation. For example, you can write a regex to parse graphemes without too much fuss[2]. (Compare that with the word segmentation regex, much to my chagrin.[3]) Once you build the regex, actually using it is basically as simple as running the regex.[4]

    Sadly, not all regex engines will be able to parse that regex due to its use of somewhat obscure Unicode properties. But the Rust regex crate can. :-)

    And of course, this somewhat shifts code size to heap size. So there's that too. But bottom line is, if you have a nice regex engine available to you, you can whip up a grapheme segmenter pretty quickly. And some regex engines even have grapheme segmentation built in via \X.

    [1]: https://github.com/BurntSushi/aho-corasick/issues/72

    [2]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

    [3]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

    [4]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • bstr

    A string type for Rust that is not required to be valid UTF-8.

  • This is just an FYI. I don't mean to say much to your overall point, although, as someone else who has spent a lot of time doing Unicode-y things, I do tend to agree with you. I had a very similar discussion a bit ago.[1]

    Putting that aside, at least with respect to grapheme segmentation, it might be a little simpler than you think. But maybe only a little. The unicode-segmentation crate also does word segmentation, which is quite a bit more complicated than grapheme segmentation. For example, you can write a regex to parse graphemes without too much fuss[2]. (Compare that with the word segmentation regex, much to my chagrin.[3]) Once you build the regex, actually using it is basically as simple as running the regex.[4]

    Sadly, not all regex engines will be able to parse that regex due to its use of somewhat obscure Unicode properties. But the Rust regex crate can. :-)

    And of course, this somewhat shifts code size to heap size. So there's that too. But bottom line is, if you have a nice regex engine available to you, you can whip up a grapheme segmenter pretty quickly. And some regex engines even have grapheme segmentation built in via \X.

    [1]: https://github.com/BurntSushi/aho-corasick/issues/72

    [2]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

    [3]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

    [4]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • how to get the index of substring in source string, support unicode in rust.

    1 project | /r/rust | 5 Nov 2023
  • Aho Corasick Algorithm For Efficient String Matching (Python & Golang Code Examples)

    1 project | /r/programming | 6 Oct 2023
  • When counting lines in Ruby randomly failed our deployments

    4 projects | /r/ruby | 22 Sep 2023
  • Aho-corasick (and the regex crate) now uses SIMD on aarch64

    2 projects | news.ycombinator.com | 18 Sep 2023
  • Are crate versions numbers all low because Rust just works?

    4 projects | /r/rust | 15 Aug 2022