结论
先抛出一个结论:
- MySQL中,utf8 又名 utf8mb3,存储的字符使用 1~3 个 byte
- utf8mb4,储存字符时,使用 1~4 个 byte
- utf8mb4 是 utf8 的超集,对于 utf8 存储的内容, utf8mb4 使用相同的方式存储。
- utf8的储存的东西,可以无痛转为 utf8mb4
理论支撑
给出结论,是要有理论依据的,让我们来查查 MySQL 官方文档:
10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
The
utfmb4character set has these characteristics:
- Supports BMP and supplementary[翻译:补充的] characters.
- Requires a maximum of four bytes per multibyte character.
utf8mb4contrasts with theutf8mb3character set, which supports only BMP characters and uses a maximum of three bytes per character:
- For a BMP character,
utf8mb4andutf8mb3have identical[翻译:完全相同] storage characteristics: same code values, same encoding, same length.- For a supplementary character,
utf8mb4requires four bytes to store it, whereasutf8mb3cannot store the character at all. When convertingutf8mb3columns toutf8mb4, you need not worry about converting supplementary characters because there will be none.undefined
utf8mb4is a superset ofutf8mb3, so for an operation such as the following concatenation, the result has character setutf8mb4and the collation ofutf8mb4_col:Similarly, the following comparison in the
undefinedWHEREclause works according to the collation ofutf8mb4_col:For information about data type storage as it relates to multibyte character sets, see String Type Storage Requirements.
注意我标注的加粗的部分。
PHP 中目前主流处理
PDO
目前 php 链接数据库时,可以设定字符集。比如 pdo,那么我们在使用 pdo 时,就需要设定 charset 为 utf8mb4 了,
来看一则 stackoverflow 的问答
Question:
when initializing PDO - should I do: charset=UTF8 or charset=UTF8MB4 ?
here’s my intialization:
undefinedBut should dsn be this:
undefinedif mysql database has a default charset UTF8MB4.
asked Jul 27 ‘15 at 17:56
Dannyboy
Answer
You should use utf8mb4 for PDO and your database structures.
undefinedWhen possible, don’t forget to set the character encoding of your pages as well. PHP example:
undefined
Laravel
我们来看看优雅的 laravel 如何处理的:
file: config/database.php1
2
3
4
5
6
7
8
9
10
11
12
13
14
15'mysql' => [
'driver' => 'mysql',
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'unix_socket' => env('DB_SOCKET', ''),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'prefix_indexes' => true,
'strict' => true,
'engine' => null,
],
所以, utf8mb4 用起来吧!
至于为什么默认的 utf8 不采取 4 个 byte 来存储,想必是 MySQL 设计初期还没有这么多奇奇怪怪的字符吧。为了性能效率,所以用了最多 3 个来储存。
感兴趣的童鞋可以去搜罗下相关资料。
补充
Note
The
utf8mb3character set is deprecated and will be removed in a future MySQL release.
Please useutf8mb4instead. Althoughutf8is currently an alias forutf8mb3,
at that pointutf8will become a reference toutf8mb4.
To avoid ambiguity about the meaning ofutf8, consider specifyingutf8mb4explicitly for character set references
instead ofutf8.
啥意思呢,就是说,未来就没有 utfbmb3 了,那时候,utf8 代表的就是 utf8mb4 了,期待那一天吧!