Can I get a single canonical UTF-8 string from a Unicode string?

末鹿安然 提交于 2020-01-05 10:29:27

问题


I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.

I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.

So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?


回答1:


Any given Unicode string will have only one representation in UTF-8.

I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.

But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.

The conversion rules are here: http://en.wikipedia.org/wiki/UTF-8




回答2:


As John already said, there is only one standards-conforming UTF-8 representation.

But the tricky point is "standards-conforming". Older encoders are usually unable to properly convert UTF-16 because of surrogates. Java is one notable case of those non-conforming converters (it will produce two 3-bytes sequences instead of one 4-byte sequence). MySQL had problems until recently, and I am not sure about the current status.

Now, you will only have problems with code points that need surrogates, meaning above U+FFFF. If you application survived without Unicode for a long time, it means you never needed to move such "esoteric" characters :-)

But it is good to get things right from the get go. Try using standards-conforming encoders and you will be fine.



来源:https://stackoverflow.com/questions/4166094/can-i-get-a-single-canonical-utf-8-string-from-a-unicode-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!