在MySQL中使用utf8mb4来取代utf8(又名:utf8mb3)

结论

先抛出一个结论:

  • MySQL中,utf8 又名 utf8mb3,存储的字符使用 1~3 个 byte
  • utf8mb4,储存字符时,使用 1~4 个 byte
  • utf8mb4 是 utf8 的超集,对于 utf8 存储的内容, utf8mb4 使用相同的方式存储。
  • utf8的储存的东西,可以无痛转为 utf8mb4

理论支撑

给出结论,是要有理论依据的,让我们来查查 MySQL 官方文档:

10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)

The utfmb4 character set has these characteristics:

  • Supports BMP and supplementary[翻译:补充的] characters.
  • Requires a maximum of four bytes per multibyte character.

utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:

  • For a BMP character, utf8mb4 and utf8mb3 have identical[翻译:完全相同] storage characteristics: same code values, same encoding, same length.
  • For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns toutf8mb4, you need not worry about converting supplementary characters because there will be none.

utf8mb4 is a superset of utf8mb3, so for an operation such as the following concatenation, the result has character set utf8mb4 and the collation of utf8mb4_col:

undefined

Similarly, the following comparison in the WHERE clause works according to the collation of utf8mb4_col:

undefined

For information about data type storage as it relates to multibyte character sets, see String Type Storage Requirements.

注意我标注的加粗的部分。

PHP 中目前主流处理

PDO

目前 php 链接数据库时,可以设定字符集。比如 pdo,那么我们在使用 pdo 时,就需要设定 charset 为 utf8mb4 了,

来看一则 stackoverflow 的问答

Question:

when initializing PDO - should I do: charset=UTF8 or charset=UTF8MB4 ?

here’s my intialization:

undefined

But should dsn be this:

undefined

if mysql database has a default charset UTF8MB4.

mysql pdo character-encoding

shareedit

asked Jul 27 ‘15 at 17:56

Dannyboy

Answer

You should use utf8mb4 for PDO and your database structures.

undefined

When possible, don’t forget to set the character encoding of your pages as well. PHP example:

undefined

Laravel

我们来看看优雅的 laravel 如何处理的:

file: config/database.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
'mysql' => [
'driver' => 'mysql',
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'unix_socket' => env('DB_SOCKET', ''),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'prefix_indexes' => true,
'strict' => true,
'engine' => null,
],

所以, utf8mb4 用起来吧!

至于为什么默认的 utf8 不采取 4 个 byte 来存储,想必是 MySQL 设计初期还没有这么多奇奇怪怪的字符吧。为了性能效率,所以用了最多 3 个来储存。
感兴趣的童鞋可以去搜罗下相关资料。


补充

Note

The utf8mb3 character set is deprecated and will be removed in a future MySQL release.
Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3,
at that point utf8 will become a reference to utf8mb4.
To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4explicitly for character set references
instead of utf8.

啥意思呢,就是说,未来就没有 utfbmb3 了,那时候,utf8 代表的就是 utf8mb4 了,期待那一天吧!