The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • xi-editor

    A modern editor with a backend written in Rust.

  • > thing that gets deleted when you hit backspace

    Is there a canonical source for this part, by the way? Xi copied the logic from Android[1] (as per the issue you linked downthread), and I vaguely remember that CLDR had something to say about this too, but I don’t know if there’s any sort of consensus here that’s actually written down anywhere.

    [1] https://github.com/xi-editor/xi-editor/pull/837

  • grapheme-splitter-lite

    A light-weight Java library that breaks strings into user-perceived characters a.k.a. Grapheme Clusters for common cases.

  • In Java/Kotlin, I've found this Grapheme Splitter library to be useful: https://github.com/hiking93/grapheme-splitter-lite

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • The author has several other writeups:

    https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

    The cursors will only be a problem during front page HN traffic. And the opt-out for people who care is reader mode / disable js / static mirror. Not sure if there's any better way to appease the fun-havers and the plain content preferrers at the same time. Maybe a "hide cursors" button on screen? I, for one, had a delightful moment poking other cursors.

  • janet-utf8

    Janet routines for utf8 handling

  • Regarding UTF-8 encoding:

    “And a couple of important consequences:

    - You CAN’T determine the length of the string by counting bytes.

    - You CAN’T randomly jump into the middle of the string and start reading.

    - You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.”

    One of the things I had to get used to when learning the programming language Janet is that strings are just plain byte sequences, unaware of any encoding. So when I call `length` on a string of one character that is represented by 2 bytes in UTF-8 (e.g. `ä`), the function returns 2 instead of 1. Similar issues occur when trying to take a substring, as mentioned by the author.

    As much as I love the approach Janet took here (it feels clean and simple and works well with their built-in PEGs), it is a bit annoying to work with outside of the ASCII range. Fortunately, there are libraries that can deal with this issue (e.g. https://github.com/andrewchambers/janet-utf8), but I wish they would support conversion to/from UTF-8 out of the box, since I generally like Janet very much.

    One interesting thing I learned from the article is that the first byte can always be determined from its prefix. I always wondered how you would recognize/separate a unicode character in a Janet string since it may have 1-4 bytes length, but I guess this is the answer.

  • text

    A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future. (by soasis)

  • It sounds like a generic length function in Unicode in 2023 is no longer a good idea. These articles complaining about the variety of lengths in Unicode are annoying at this point. Pretty much all of them can be summed up as, "Well, it depends." And, that isn't wrong. But nerds love to argue until they are blue in the face about the One Correct Answer. Sheesh.

    This is the most interesting comparison article I have seen in years about Unicode processing in C++: https://thephd.dev/the-c-c++-rust-string-text-encoding-api-l...

    The author is also the lead on an open source C++ Unicode library called ztd.txt: https://github.com/soasis/text

  • tonsky.me

  • Updated a day after it was mentioned here :-)

    https://github.com/tonsky/tonsky.me/commit/5fbbb373025be3758...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts