Our great sponsors
-
MySQL
MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
The most perplexing aspect though is that support for 6 byte sequences was already implemented, and then shortened to only 3 bytes in 2002 with the commit message simply stating "UTF8 now works with up to 3 byte sequences only":
https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...
VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case. But with utf8mb4, it's now 4N bytes, which (with a high N) may exceed internal limits such as maximum length of an index key.
IIRC, there were additional problems in older versions of MySQL, situations where sort buffers were sized to a fixed length equal to the value's worst-case size or something like that. So sorting a large number of utf8mb4 values would use a lot more memory than utf8mb3 values (again, iirc, I might be wrong on this).
So the safer and more backwards-compatible approach was to introduce utf8mb4 as a new separate charset, and allow users to choose. MySQL 8 is now transitioning towards deprecating utf8mb3, and will finally make the utf8 alias point to utf8mb4 sometime in the near future.
That said, there are still a bunch of unpleasant uses of utf8mb3 internally in things like information_schema. I develop schema management tooling and recently lost a week to some of the more obscure ones in https://github.com/skeema/skeema/commit/bf38edb :)
> they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.
Specifically, indexes are limited to 767 bytes (some say 768?), which means 191 utf8mb4 characters. So your otherwise-innocuous VARCHAR(255) primary key breaks. I’ve seen this multiple times, e.g. django-celery probably still has this: https://github.com/celery/django-celery/issues/259.