Database character sets and collations explained – why utf8 is not UTF-8

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • MySQL

    MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.

  • The most perplexing aspect though is that support for 6 byte sequences was already implemented, and then shortened to only 3 bytes in 2002 with the commit message simply stating "UTF8 now works with up to 3 byte sequences only":

    https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...

  • skeema

    Declarative pure-SQL schema management for MySQL and MariaDB

  • VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case. But with utf8mb4, it's now 4N bytes, which (with a high N) may exceed internal limits such as maximum length of an index key.

    IIRC, there were additional problems in older versions of MySQL, situations where sort buffers were sized to a fixed length equal to the value's worst-case size or something like that. So sorting a large number of utf8mb4 values would use a lot more memory than utf8mb3 values (again, iirc, I might be wrong on this).

    So the safer and more backwards-compatible approach was to introduce utf8mb4 as a new separate charset, and allow users to choose. MySQL 8 is now transitioning towards deprecating utf8mb3, and will finally make the utf8 alias point to utf8mb4 sometime in the near future.

    That said, there are still a bunch of unpleasant uses of utf8mb3 internally in things like information_schema. I develop schema management tooling and recently lost a week to some of the more obscure ones in https://github.com/skeema/skeema/commit/bf38edb :)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • django-celery

    Old Celery integration project for Django

  • > they would now need to store max 4N bytes, which (with a high N) can exceed internal limits such as maximum length of an index key.

    Specifically, indexes are limited to 767 bytes (some say 768?), which means 191 utf8mb4 characters. So your otherwise-innocuous VARCHAR(255) primary key breaks. I’ve seen this multiple times, e.g. django-celery probably still has this: https://github.com/celery/django-celery/issues/259.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts