Python C# - Unicode character is not the same on Python and C#

后端 未结 1 2034
庸人自扰
庸人自扰 2020-12-22 14:00

I encountered with a problem while working on text files. I found that the character Unicode representation on Python and C# is different.

相关标签:
1条回答
  • 2020-12-22 14:27

    You can't fix it. It is inherent in the Unicode implementation of the languages.

    When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.

    Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.

    Python 2 (same leaky abstraction as C# and Java):

    Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len(u'\U0001F464')
    2
    >>> u'\U0001F464'[0]
    u'\ud83d'
    >>> u'\U0001F464'[1]
    u'\udc64'
    

    Python 3.3+:

    Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len(u'\U0001F464')
    1
    >>> u'\U0001F464'[0]
    '                                                                    
    0 讨论(0)
提交回复
热议问题