What are surrogate characters in UTF-8?

烈酒焚心 提交于 2019-12-24 11:16:04

问题


I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.


回答1:


What are surrogate characters in UTF-8?

This is almost like a trick question.

Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).

Approximate answer #2: Invalid (if not paired).

Approximate answer #3: It's not UTF-8; It's Modified UTF-8.

Synopsis: The term doesn't apply to UTF-8.

Unicode codepoints have a range that needs 21 bits of data.

UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.

UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.

#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.

#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).

#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.

Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.



来源:https://stackoverflow.com/questions/51001150/what-are-surrogate-characters-in-utf-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!