结论
先抛出一个结论:
- MySQL中,utf8 又名 utf8mb3,存储的字符使用 1~3 个 byte
- utf8mb4,储存字符时,使用 1~4 个 byte
- utf8mb4 是 utf8 的超集,对于 utf8 存储的内容, utf8mb4 使用相同的方式存储。
- utf8的储存的东西,可以无痛转为 utf8mb4
理论支撑
给出结论,是要有理论依据的,让我们来查查 MySQL 官方文档:
10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
The
utfmb4
character set has these characteristics:
- Supports BMP and supplementary[翻译:补充的] characters.
- Requires a maximum of four bytes per multibyte character.
utf8mb4
contrasts with theutf8mb3
character set, which supports only BMP characters and uses a maximum of three bytes per character:
- For a BMP character,
utf8mb4
andutf8mb3
have identical[翻译:完全相同] storage characteristics: same code values, same encoding, same length.- For a supplementary character,
utf8mb4
requires four bytes to store it, whereasutf8mb3
cannot store the character at all. When convertingutf8mb3
columns toutf8mb4
, you need not worry about converting supplementary characters because there will be none.undefined
utf8mb4
is a superset ofutf8mb3
, so for an operation such as the following concatenation, the result has character setutf8mb4
and the collation ofutf8mb4_col
:Similarly, the following comparison in the
undefinedWHERE
clause works according to the collation ofutf8mb4_col
:For information about data type storage as it relates to multibyte character sets, see String Type Storage Requirements.
注意我标注的加粗的部分。
PHP 中目前主流处理
PDO
目前 php 链接数据库时,可以设定字符集。比如 pdo,那么我们在使用 pdo 时,就需要设定 charset 为 utf8mb4 了,
来看一则 stackoverflow 的问答
Question:
when initializing PDO - should I do: charset=UTF8 or charset=UTF8MB4 ?
here’s my intialization:
undefinedBut should dsn be this:
undefinedif mysql database has a default charset UTF8MB4.
asked Jul 27 ‘15 at 17:56
Dannyboy
Answer
You should use utf8mb4 for PDO and your database structures.
undefinedWhen possible, don’t forget to set the character encoding of your pages as well. PHP example:
undefined
Laravel
我们来看看优雅的 laravel 如何处理的:
file: config/database.php1
2
3
4
5
6
7
8
9
10
11
12
13
14
15'mysql' => [
'driver' => 'mysql',
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'unix_socket' => env('DB_SOCKET', ''),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'prefix_indexes' => true,
'strict' => true,
'engine' => null,
],
所以, utf8mb4
用起来吧!
至于为什么默认的 utf8 不采取 4 个 byte 来存储,想必是 MySQL 设计初期还没有这么多奇奇怪怪的字符吧。为了性能效率,所以用了最多 3 个来储存。
感兴趣的童鞋可以去搜罗下相关资料。
补充
Note
The
utf8mb3
character set is deprecated and will be removed in a future MySQL release.
Please useutf8mb4
instead. Althoughutf8
is currently an alias forutf8mb3
,
at that pointutf8
will become a reference toutf8mb4
.
To avoid ambiguity about the meaning ofutf8
, consider specifyingutf8mb4
explicitly for character set references
instead ofutf8
.
啥意思呢,就是说,未来就没有 utfbmb3 了,那时候,utf8 代表的就是 utf8mb4 了,期待那一天吧!