What is the difference between utf8mb4 and utf8 charsets in MySQL?

后端 未结 5 1990
轮回少年
轮回少年 2020-11-22 13:37

What is the difference between utf8mb4 and utf8 charsets in MySQL?

I already know about ASCII, UTF-8, UTF-16

5条回答
  •  盖世英雄少女心
    2020-11-22 14:16

    In MySQL, utf8 refers to a flawed implementation of the UTF-8 standard in which not all character ranges are supported.

    Specifically, only characters in the basic multilingual plane work, with other characters considered invalid. This is because the values within that plane - 0 to 65535 (some of which are reserved for special reasons) can be represented by multi-byte encodings in UTF-8 of up to 3 bytes, and MySQL's take on UTF-8 arbitrarily decided to set that as a limit.

    Back when MySQL released this, that wasn't much of a problem. Since then, more and more newly defined character ranges have been added to Unicode with values outside the basic multilingual plane.

    In an effort not to break old code making any particular assumptions, MySQL retained the broken implementation and called the newer, fixed version utf8mb4. This has led to some confusion with the name being misinterpreted as if it's some kind of extension to UTF-8, rather than MySQL's official true implementation of UTF-8.

    Future versions of MySQL may eventually phase out the older version, but for the forseeable future utf8mb4 is to be used instead to ensure correct UTF-8 encoding.

    Some may take issue to me describing the older, non-compliant implementation as flawed or broken. But, it is true that by only allowing multi-byte encodings of up to 3 bytes it never correctly followed the UTF-8 standard as it existed at any point in time and that it is the reason for its flaws. At no point was UTF-8 defined as supporting up to 3 bytes: The only time it was not defined as being up to 4 bytes was when it was originally defined as being up to 6 bytes (!!) - which subsequent Unicode specs have decided was overkill.

提交回复
热议问题