Let's Stop Ascribing Meaning to Code Points (2017)

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

language-server-protocol

121 10,725 8.7 HTML

Defines a common protocol for language servers.

Yup, code points are the way to go. The LSP is a great example of what happens otherwise [1]. The LSP authors made the controversial decision to sync documents using line/columns and UTF-16 code unit offsets rather than relying on code points. I fear they dug the hole deeper by expanding the set of supported encodings to UTF-8 and 32 rather than standardizing on code points. I can only hope this is treated as a stop gap and a proper switch to code points happens in the next major revision.
It's not immediately obvious, but line/columns are not a good synchronization method either. Unicode has their own definition of what constitutes a line terminator [2], but different programming languages and environments might have their own definition.
[1] https://github.com/microsoft/language-server-protocol/issues...
[2] https://en.wikipedia.org/wiki/Newline#Unicode

aho-corasick

21 950 7.2 Rust

A fast implementation of Aho-Corasick in Rust.

This is just an FYI. I don't mean to say much to your overall point, although, as someone else who has spent a lot of time doing Unicode-y things, I do tend to agree with you. I had a very similar discussion a bit ago.[1]
Putting that aside, at least with respect to grapheme segmentation, it might be a little simpler than you think. But maybe only a little. The unicode-segmentation crate also does word segmentation, which is quite a bit more complicated than grapheme segmentation. For example, you can write a regex to parse graphemes without too much fuss[2]. (Compare that with the word segmentation regex, much to my chagrin.[3]) Once you build the regex, actually using it is basically as simple as running the regex.[4]
Sadly, not all regex engines will be able to parse that regex due to its use of somewhat obscure Unicode properties. But the Rust regex crate can. :-)
And of course, this somewhat shifts code size to heap size. So there's that too. But bottom line is, if you have a nice regex engine available to you, you can whip up a grapheme segmenter pretty quickly. And some regex engines even have grapheme segmentation built in via \X.
[1]: https://github.com/BurntSushi/aho-corasick/issues/72
[2]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...
[3]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...
[4]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
bstr

10 744 6.7 Rust

A string type for Rust that is not required to be valid UTF-8.

This is just an FYI. I don't mean to say much to your overall point, although, as someone else who has spent a lot of time doing Unicode-y things, I do tend to agree with you. I had a very similar discussion a bit ago.[1]
Putting that aside, at least with respect to grapheme segmentation, it might be a little simpler than you think. But maybe only a little. The unicode-segmentation crate also does word segmentation, which is quite a bit more complicated than grapheme segmentation. For example, you can write a regex to parse graphemes without too much fuss[2]. (Compare that with the word segmentation regex, much to my chagrin.[3]) Once you build the regex, actually using it is basically as simple as running the regex.[4]
Sadly, not all regex engines will be able to parse that regex due to its use of somewhat obscure Unicode properties. But the Rust regex crate can. :-)
And of course, this somewhat shifts code size to heap size. So there's that too. But bottom line is, if you have a nice regex engine available to you, you can whip up a grapheme segmenter pretty quickly. And some regex engines even have grapheme segmentation built in via \X.
[1]: https://github.com/BurntSushi/aho-corasick/issues/72
[2]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...
[3]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...
[4]: https://github.com/BurntSushi/bstr/blob/e38e7a7ca986f9499b30...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

how to get the index of substring in source string, support unicode in rust.

1 project | /r/rust | 5 Nov 2023
Aho Corasick Algorithm For Efficient String Matching (Python & Golang Code Examples)

1 project | /r/programming | 6 Oct 2023
When counting lines in Ruby randomly failed our deployments

4 projects | /r/ruby | 22 Sep 2023
Aho-corasick (and the regex crate) now uses SIMD on aarch64

2 projects | news.ycombinator.com | 18 Sep 2023
Are crate versions numbers all low because Rust just works?

4 projects | /r/rust | 15 Aug 2022

Let's Stop Ascribing Meaning to Code Points (2017)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
aho-corasick Bytes substring-matching Unicode finite-state-machine
Post date: 26 Jun 2022

language-server-protocol

aho-corasick

InfluxDB

bstr

Related posts

how to get the index of substring in source string, support unicode in rust.

Aho Corasick Algorithm For Efficient String Matching (Python & Golang Code Examples)

When counting lines in Ruby randomly failed our deployments

Aho-corasick (and the regex crate) now uses SIMD on aarch64

Are crate versions numbers all low because Rust just works?

Let's Stop Ascribing Meaning to Code Points (2017)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com aho-corasick Bytes substring-matching Unicode finite-state-machine Post date: 26 Jun 2022

language-server-protocol

aho-corasick

InfluxDB

bstr

Related posts

how to get the index of substring in source string, support unicode in rust.

Aho Corasick Algorithm For Efficient String Matching (Python &amp; Golang Code Examples)

When counting lines in Ruby randomly failed our deployments

Aho-corasick (and the regex crate) now uses SIMD on aarch64

Are crate versions numbers all low because Rust just works?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
aho-corasick Bytes substring-matching Unicode finite-state-machine
Post date: 26 Jun 2022

Aho Corasick Algorithm For Efficient String Matching (Python & Golang Code Examples)