Unicode Replacement Characters in the PHP htmlspecialchars function

試著忘記壹切 提交于 2019-12-11 18:17:02

问题


In the htmlspecialchars function, if you set the ENT_SUBSTITUTE flag, it is supposed to replace some invalid characters.

What characters are replaced? And what is the mapping between the invalid characters and the ones that are used to replace it?


回答1:


There is only one, universal replacement character: U+FFFD. If you are writing out UTF-8, then this codepoint is appropriately encoded. If not, you get the corresponding character reference � instead.

There is no reversible mapping. By definition, the original byte sequence was invalid, i.e. it does not have a value (valid = has a value).

Bytes (not really "characters") that are replaced are those that are not valid in the assumed source encoding. For example, if your source encoding was UTF-16 and you had a lone surrogate, that would be "invalid" (though technically any text processor is supposed to abort fatally in that situation). As a better example, if the source encoding is ASCII, then any value above 127 is an invalid character.



来源:https://stackoverflow.com/questions/8205729/unicode-replacement-characters-in-the-php-htmlspecialchars-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!