Does using ASCII/Latin Charset speed up the database?

牧云@^-^@ 提交于 2019-12-05 17:35:03

@RickJames is right, you should not worry about saving space by choosing ASCII or utf8 over utf8mb4.

utf8 and utf8mb4 are variable-length character encodings. This table from wikipedia illustrates how characters automatically take 1, 2, 3, or 4 bytes each, depending on the value encoded. If the high bit of a byte is set, then the character uses an additional byte, up to 4 bytes.

The wikipedia article explains it clearly:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

You don't have to do anything to choose single-byte versus multi-byte mode. This is just the way the encoding works. Each character automatically uses the number of bytes it needs, and no more.

So there is no advantage to using utf8 over utf8mb4, and no advantage of using ASCII over either, unless you need to restrict the characters allowed in a string.

For what it's worth, the character set MySQL calls "utf8" is an alias for utf8mb3, an implementation of just the first three bytes of the UTF8 encoding. The MySQL server team blog (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) says that utf8mb4 is faster, at least given performance improvements in MySQL 8.0, and utf8mb3 should be considered deprecated. MySQL 8.0.11 release notes say that utf8 will be redefined as an alias for utf8mb4 in some future version of MySQL.

Short Answer: Not worth worrying about.

Long Answer:

Two issues:

  • Speed:

Comparing two encodings with the corresponding _bin (ascii_bin or utf8_bin) COLLATION is as simple as comparing the bytes -- so no significant difference. Other collations can differ, with ascii being faster. But the difference is insignificant compared to the effort of fetching rows, etc.

  • Space:

Ascii is a subset of utf8. utf8 stores only 1 byte for each ascii character, just as ascii does. So, no space difference. (Accented letters in Western Europe need either 1-byte latin1 or 2-byte utf8; hence incompatible and different in size.) Space leads to caching, which leads to a slight difference in performance.

For English text, 0% savings. For European, latin1 would save only a few percent; For most the rest of the world, utf8 are the only viable solution. For Chinese and Emoji, utf8mb4 is a must.

  • Temp tables

In certain situations, the space consumed by a string expands to the potential max. country_code CHAR(2) CHARACTER SET ... will take 2 bytes for ascii; 6 bytes for utf8.

Bottom Line:

Use ascii for country codes, hex, postal codes, uuids, md5s etc. If you are going international, and/or need Emoji, then make your "strings" utf8mb4. But do it because it is 'right', not because you will get magically marvelously much more speed; you won't. And do it whenever you create a table; it's the pits to change it later.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!