The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

xi-editor

42 19,807 2.6 Rust

A modern editor with a backend written in Rust.

> thing that gets deleted when you hit backspace
Is there a canonical source for this part, by the way? Xi copied the logic from Android[1] (as per the issue you linked downthread), and I vaguely remember that CLDR had something to say about this too, but I don’t know if there’s any sort of consensus here that’s actually written down anywhere.
[1] https://github.com/xi-editor/xi-editor/pull/837

grapheme-splitter-lite

1 6 10.0 Kotlin

A light-weight Java library that breaks strings into user-perceived characters a.k.a. Grapheme Clusters for common cases.

In Java/Kotlin, I've found this Grapheme Splitter library to be useful: https://github.com/hiking93/grapheme-splitter-lite

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
hn-search

1,619 524 2.9 TypeScript

Hacker News Search

The author has several other writeups:
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
The cursors will only be a problem during front page HN traffic. And the opt-out for people who care is reader mode / disable js / static mirror. Not sure if there's any better way to appease the fun-havers and the plain content preferrers at the same time. Maybe a "hide cursors" button on screen? I, for one, had a delightful moment poking other cursors.

janet-utf8

1 16 10.0 C

Janet routines for utf8 handling

Regarding UTF-8 encoding:
“And a couple of important consequences:
- You CAN’T determine the length of the string by counting bytes.
- You CAN’T randomly jump into the middle of the string and start reading.
- You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.”
One of the things I had to get used to when learning the programming language Janet is that strings are just plain byte sequences, unaware of any encoding. So when I call `length` on a string of one character that is represented by 2 bytes in UTF-8 (e.g. `ä`), the function returns 2 instead of 1. Similar issues occur when trying to take a substring, as mentioned by the author.
As much as I love the approach Janet took here (it feels clean and simple and works well with their built-in PEGs), it is a bit annoying to work with outside of the ASCII range. Fortunately, there are libraries that can deal with this issue (e.g. https://github.com/andrewchambers/janet-utf8), but I wish they would support conversion to/from UTF-8 out of the box, since I generally like Janet very much.
One interesting thing I learned from the article is that the first byte can always be determined from its prefix. I always wondered how you would recognize/separate a unicode character in a Janet string since it may have 1-4 bytes length, but I guess this is the answer.

text

10 297 6.8 C++

A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future. (by soasis)

It sounds like a generic length function in Unicode in 2023 is no longer a good idea. These articles complaining about the variety of lengths in Unicode are annoying at this point. Pretty much all of them can be summed up as, "Well, it depends." And, that isn't wrong. But nerds love to argue until they are blue in the face about the One Correct Answer. Sheesh.
This is the most interesting comparison article I have seen in years about Unicode processing in C++: https://thephd.dev/the-c-c++-rust-string-text-encoding-api-l...
The author is also the lead on an open source C++ Unicode library called ztd.txt: https://github.com/soasis/text

tonsky.me

1 10 8.8 Clojure

Updated a day after it was mentioned here :-)
https://github.com/tonsky/tonsky.me/commit/5fbbb373025be3758...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ask HN: Catching Up on C++?
1 project | news.ycombinator.com | 20 Feb 2024
C++ learning
1 project | /r/cpp | 7 Dec 2023
Modern C++ Programming Course
1 project | /r/hypeurls | 29 Nov 2023
E-Book Kindle sau PDF (engleză) despre C++
1 project | /r/programare | 8 Jul 2023
uni-algo v1.0.0: Modern Unicode Library
6 projects | /r/cpp | 7 Jul 2023

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Applications written in Rust CPP Virtualization Cpp17 Cpp20
Post date: 2 Oct 2023

xi-editor

grapheme-splitter-lite

InfluxDB

hn-search

janet-utf8

text

tonsky.me

Related posts

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Applications written in Rust CPP Virtualization Cpp17 Cpp20 Post date: 2 Oct 2023

xi-editor

grapheme-splitter-lite

InfluxDB

hn-search

janet-utf8

text

tonsky.me

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Applications written in Rust CPP Virtualization Cpp17 Cpp20
Post date: 2 Oct 2023