Our great sponsors
-
grapheme-splitter
A JavaScript library that breaks strings into their individual user-perceived characters.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Exactly, and emoji are outside the BMP, so it's not exactly an edge case, but the norm where two code units (UTF-16 double-bytes) are used to make one code point (Unicode character).
And it gets even worse, when you consider that for many purposes you're not even interested in code points but in graphemes -- e.g. a single visible emoji might actually be a combination of 5 code points, represented by 8 UTF-8 code units, taking up 16 bytes.
If you want to split a string by graphemes, you can either use the main dedicated library for it [3], or the relatively new API Intl.Segmenter [4] which is in Chrome and Safari, but still hasn't made it to Firefox [5].
[1] https://blog.jonnew.com/posts/poo-dot-length-equals-two
[2] https://www.contentful.com/blog/2016/12/06/unicode-javascrip...
[3] https://github.com/orling/grapheme-splitter
[4] https://github.com/tc39/proposal-intl-segmenter
[5] https://bugzilla.mozilla.org/show_bug.cgi?id=1423593
Exactly, and emoji are outside the BMP, so it's not exactly an edge case, but the norm where two code units (UTF-16 double-bytes) are used to make one code point (Unicode character).
And it gets even worse, when you consider that for many purposes you're not even interested in code points but in graphemes -- e.g. a single visible emoji might actually be a combination of 5 code points, represented by 8 UTF-8 code units, taking up 16 bytes.
If you want to split a string by graphemes, you can either use the main dedicated library for it [3], or the relatively new API Intl.Segmenter [4] which is in Chrome and Safari, but still hasn't made it to Firefox [5].
[1] https://blog.jonnew.com/posts/poo-dot-length-equals-two
[2] https://www.contentful.com/blog/2016/12/06/unicode-javascrip...
[3] https://github.com/orling/grapheme-splitter
[4] https://github.com/tc39/proposal-intl-segmenter
[5] https://bugzilla.mozilla.org/show_bug.cgi?id=1423593
It's latin1. The same is true of DOM strings in Chromium, like attributes, blocks of text, and inline scripts.
Webkit and the JDK implement the same string optimization, while .NET unfortunately doesn't: https://github.com/dotnet/runtime/issues/6612
I’m surprised to see no mention of tagged literals, a much more complex version of template literals. For users they may seem ~like a function call without parentheses. But they do quite a bit more.
Short version: they accept an array of raw substrings and a variadic set of arguments corresponding to the runtime values provided in template positions, each positional value corresponding following the raw string preceding it.
That raw array is more than what it seems, it also has a getter of raw string values for the template expressions. This is what String.raw (also not mentioned) uses to treat those arguments essentially the same way an untagged template literal would.
It’s an odd design/interface but it can be used to do some pretty cool stuff. For example, Zapatos[1], a type-safe SQL library for TypeScript.
My only complaints:
- I can’t think of a real reason for it to be variadic, and this makes authoring them a little more error prone. You should be able to expect one array of strings with a length N, and one array of (type checkable/inferrable) values with a length N-1.
2. Likewise I can’t think of a real reason for the raw values to be bolted onto a weird array subclass. It could just as easily have been an iterable third argument.
1: https://github.com/jawj/zapatos