Python: how to convert utf-8 code string back to string?

后端 未结 3 2117
遥遥无期
遥遥无期 2020-12-15 01:52

I am using Python and unfortunately my code needs to convert a string that represents the utf-8 code of a string in to the original string, like:

UTF-8 code string t

相关标签:
3条回答
  • 2020-12-15 02:20

    I think this is what you want. It isn't UTF-8 byte string (well, technically it is, but only because ASCII is a subset of UTF-8).

    >>> s='\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'
    >>> print s.decode('unicode-escape')
    欢迎提交微博搜索使用反馈,请直接
    

    FYI, this is UTF-8:

    >>> s.decode('unicode-escape').encode('utf8')
    

    '\xe6\xac\xa2\xe8\xbf\x8e\xe6\x8f\x90\xe4\xba\xa4\xe5\xbe\xae\xe5\x8d\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe4\xbd\xbf\xe7\x94\xa8\xe5\x8f\x8d\xe9\xa6\x88\xef\xbc\x8c\xe8\xaf\xb7\xe7\x9b\xb4\xe6\x8e\xa5'

    0 讨论(0)
  • 2020-12-15 02:21

    If I understand the question, we have a simple byte string, having Unicode escaping in it, or something like that:

    a = '\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'
    
    In [122]: a
    Out[122]: '\\u6b22\\u8fce\\u63d0\\u4ea4\\u5fae\\u535a\\u641c\\u7d22\\u4f7f\\u7528\\u53cd\\u9988\\uff0c\\u8bf7\\u76f4\\u63a5'
    

    So we need to manually parse the unicode values from the string, using the Unicode code points:

    \u6b22 => unichr(0x6b22) # 欢
    

    or finally:

    print "".join([unichr(int('0x'+a[i+2:i+6], 16)) for i in range(0, len(a), 6)])
    欢迎提交微博搜索使用反馈,请直接
    
    0 讨论(0)
  • 2020-12-15 02:37

    Mark Pilgrim had explained this in his book. Take a look

    http://www.diveintopython.net/xml_processing/unicode.html

    >>> s = u"\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5"
    
    >>> print s.encode("utf-8")
    
    >>> 欢迎提交微博搜索使用反馈,请直接
    
    0 讨论(0)
提交回复
热议问题