Our great sponsors
-
MySQL
MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
MySQLTuner-perl
MySQLTuner is a script written in Perl that will assist you with your MySQL configuration and make recommendations for increased performance and stability.
-
postgresqltuner
Simple script to analyse your PostgreSQL database configuration, and give tuning advice
-
docker-postgres-upgrade
a PoC for using "pg_upgrade" inside Docker -- learn from it, adapt it for your needs; don't expect it to work as-is!
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Okay, perhaps a more detailed breakdown of the false claims will help, so that nobody is misled by revisionist/apologist history.
> Unicode was assumed to be 64k of codepoints
Not in 2002, when MySQL restricted their utf8 to three bytes [1]. Before 1997, Unicode specified clearly that 21 bits was the limit [2]. By 2002, there were 94,205 characters, including CJK characters beyond the 16 bit range, and clearly more to come. [3]
> so a 3-byte UTF-8 sequence was considered "long enough"
Not by many. The MySQL developers chose very badly, here. I and plenty of other developers managed to implement UTF-8 more correctly around that time. It wasn't hard, as the specs are very straightforward.
> especially since there were surrogate pairs for the rare cases where you have to encode higher code points
Surrogate pairs have never been supported in UTF-8. The RFCs are explicit about that. [4] [5] (search for D800)
Maybe you're thinking of CESU-8, though that's not intended for interchange. [6]
> Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.
Not supporting Unicode properly has always been problematic; it's just that bug reports from affected users rarely reached the right people. Emojis have done the world a favour in making less competent developers actually notice their bugs in basic text handling.
> Yes, it could have been avoided
And was, by most developers.
> they probably just wanted to optimize a bit.
They apparently altered a config number [1], so it wasn't an optimization decision; the code at the time still had support for 6-byte utf8 [7]. I would guess that they found a bug in their support for longer utf-8 sequences/conversion and took the hacky way out.
[1] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...
[2] https://unicode.org/faq/utf_bom.html
[3] https://en.wikibooks.org/wiki/Unicode/Versions
[4] https://datatracker.ietf.org/doc/html/rfc2044
[5] https://datatracker.ietf.org/doc/html/rfc3629
[6] https://www.unicode.org/reports/tr26/tr26-4.html
[7] https://github.com/mysql/mysql-server/blob/43a506c0ced0e6ea1...
Maybe this is a cynical take - but we used https://pgloader.io/ a few years ago to migrate to Postgres, and have never been happier.
https://github.com/s9y/Serendipity/blob/05f58f90d743fe9ade24... is just the detection function I wrote for a PHP blog engine to detect whether it would even be possible to use utf8mb4 on the system. We completely ran into this issue and didn't know how to handle it for literally years. Lots of blogs to migrate, and who knows whether it worked for all of them...
The craziest issue I had was I couldn't predict what char encoding the text in my database was in. Most users entered Windows-1252, some text blobs were UTF-16, others were European character sets, and some were UTF-8. Some were Japanese SHIFT_JIS. Don't ask me how any of this happened. I retrospect, I should have dumped all the tables from MySQL and used the excellent PyPy Chardet [1] library to see what I was dealing with, do the conversions and then re-import the data. But then someone could copy UTF-16 from a Windows document and paste it in, so you have to convert going in to the database.
You have set Apache to UTF-8, PHP to UTF-8, MySQL to UTF-8, and the MySQL driver you are using to UTF-8. It's not clear how these setting interact. Are there silent conversions happening or do you always have to detect the encoding on data coming from the server? HTML pages have a character encoding specifier, but the BOM at the start of the file takes precedence (I think.) I got it to work by always detecting encoding for any text coming from the database and using iconv, but this turned out to be really slow and unreliable. It was truly the biggest mess by an order of magnitude than any other programming problem I faced in my career.
Would not attempt again.
[1] https://github.com/chardet/chardet
Are the DBA's on this thread committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? Or at least a repo with the same objectives?
[1] - https://github.com/major/MySQLTuner-perl
[2] - https://github.com/jfcoz/postgresqltuner
Are the DBA's on this thread committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? Or at least a repo with the same objectives?
[1] - https://github.com/major/MySQLTuner-perl
[2] - https://github.com/jfcoz/postgresqltuner
> When you say "complex ride" what does that mean?
On managed postgresql services like RDS it's easy, yes, because Amazon does all of the nasty work for you and has already seen all the kinks you may run into.
If you are on your own and using Docker, you essentially need a third-party image that has the old and new binaries (https://github.com/tianon/docker-postgres-upgrade).
If you are on your own and are using distribution packages, it's more complicated (https://blog.samuel.domains/blog/tutorials/from-stretch-to-b...).
If you are on your own and use postgres from source, well... have fun, there's a reason why I prefer using distribution-maintained packages when possible.
Related posts
- Show HN: Tsynamo – Type-friendly DynamoDB query builder for TypeScript
- Flyweight: A Node.js ORM Specifically for SQLite
- We migrated our PostgreSQL database with 11 seconds downtime
- Kysely – type-safe TypeScript SQL query builder
- any typescript users, that'd be interested in using oracledb with kysely (the type-safe query builder)?