In MySQL, never use “utf8”. Use “utf8mb4”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • MySQL

    MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.

  • Okay, perhaps a more detailed breakdown of the false claims will help, so that nobody is misled by revisionist/apologist history.

    > Unicode was assumed to be 64k of codepoints

    Not in 2002, when MySQL restricted their utf8 to three bytes [1]. Before 1997, Unicode specified clearly that 21 bits was the limit [2]. By 2002, there were 94,205 characters, including CJK characters beyond the 16 bit range, and clearly more to come. [3]

    > so a 3-byte UTF-8 sequence was considered "long enough"

    Not by many. The MySQL developers chose very badly, here. I and plenty of other developers managed to implement UTF-8 more correctly around that time. It wasn't hard, as the specs are very straightforward.

    > especially since there were surrogate pairs for the rare cases where you have to encode higher code points

    Surrogate pairs have never been supported in UTF-8. The RFCs are explicit about that. [4] [5] (search for D800)

    Maybe you're thinking of CESU-8, though that's not intended for interchange. [6]

    > Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

    Not supporting Unicode properly has always been problematic; it's just that bug reports from affected users rarely reached the right people. Emojis have done the world a favour in making less competent developers actually notice their bugs in basic text handling.

    > Yes, it could have been avoided

    And was, by most developers.

    > they probably just wanted to optimize a bit.

    They apparently altered a config number [1], so it wasn't an optimization decision; the code at the time still had support for 6-byte utf8 [7]. I would guess that they found a bug in their support for longer utf-8 sequences/conversion and took the hacky way out.

    [1] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...

    [2] https://unicode.org/faq/utf_bom.html

    [3] https://en.wikibooks.org/wiki/Unicode/Versions

    [4] https://datatracker.ietf.org/doc/html/rfc2044

    [5] https://datatracker.ietf.org/doc/html/rfc3629

    [6] https://www.unicode.org/reports/tr26/tr26-4.html

    [7] https://github.com/mysql/mysql-server/blob/43a506c0ced0e6ea1...

  • pgloader

    Migrate to PostgreSQL in a single command!

  • Maybe this is a cynical take - but we used https://pgloader.io/ a few years ago to migrate to Postgres, and have never been happier.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Serendipity

    A PHP blog software

  • https://github.com/s9y/Serendipity/blob/05f58f90d743fe9ade24... is just the detection function I wrote for a PHP blog engine to detect whether it would even be possible to use utf8mb4 on the system. We completely ran into this issue and didn't know how to handle it for literally years. Lots of blogs to migrate, and who knows whether it worked for all of them...

  • chardet

    Python character encoding detector

  • The craziest issue I had was I couldn't predict what char encoding the text in my database was in. Most users entered Windows-1252, some text blobs were UTF-16, others were European character sets, and some were UTF-8. Some were Japanese SHIFT_JIS. Don't ask me how any of this happened. I retrospect, I should have dumped all the tables from MySQL and used the excellent PyPy Chardet [1] library to see what I was dealing with, do the conversions and then re-import the data. But then someone could copy UTF-16 from a Windows document and paste it in, so you have to convert going in to the database.

    You have set Apache to UTF-8, PHP to UTF-8, MySQL to UTF-8, and the MySQL driver you are using to UTF-8. It's not clear how these setting interact. Are there silent conversions happening or do you always have to detect the encoding on data coming from the server? HTML pages have a character encoding specifier, but the BOM at the start of the file takes precedence (I think.) I got it to work by always detecting encoding for any text coming from the database and using iconv, but this turned out to be really slow and unreliable. It was truly the biggest mess by an order of magnitude than any other programming problem I faced in my career.

    Would not attempt again.

    [1] https://github.com/chardet/chardet

  • MySQLTuner-perl

    MySQLTuner is a script written in Perl that will assist you with your MySQL configuration and make recommendations for increased performance and stability.

  • Are the DBA's on this thread committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? Or at least a repo with the same objectives?

    [1] - https://github.com/major/MySQLTuner-perl

    [2] - https://github.com/jfcoz/postgresqltuner

  • postgresqltuner

    Simple script to analyse your PostgreSQL database configuration, and give tuning advice

  • Are the DBA's on this thread committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? Or at least a repo with the same objectives?

    [1] - https://github.com/major/MySQLTuner-perl

    [2] - https://github.com/jfcoz/postgresqltuner

  • docker-postgres-upgrade

    a PoC for using "pg_upgrade" inside Docker -- learn from it, adapt it for your needs; don't expect it to work as-is!

  • > When you say "complex ride" what does that mean?

    On managed postgresql services like RDS it's easy, yes, because Amazon does all of the nasty work for you and has already seen all the kinks you may run into.

    If you are on your own and using Docker, you essentially need a third-party image that has the old and new binaries (https://github.com/tianon/docker-postgres-upgrade).

    If you are on your own and are using distribution packages, it's more complicated (https://blog.samuel.domains/blog/tutorials/from-stretch-to-b...).

    If you are on your own and use postgres from source, well... have fun, there's a reason why I prefer using distribution-maintained packages when possible.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts