How do I convert a unicode to a string at the Python level?

后端 未结 7 1315
刺人心
刺人心 2020-12-09 17:56

The following unicode and string can exist on their own if defined explicitly:

>>> value_str=\'Andr\\xc3\\xa9\'
>>> value_uni=u\'Andr\\xc3\         


        
相关标签:
7条回答
  • 2020-12-09 18:20

    Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.

    To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').

    You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    Another excellent article (this time Python-specific): Unicode HOWTO

    0 讨论(0)
  • 2020-12-09 18:25

    value_uni.encode('utf8') or whatever encoding you need.

    See http://docs.python.org/library/stdtypes.html#str.encode

    0 讨论(0)
  • 2020-12-09 18:29

    It seems like

    str(value_uni)
    

    should work... at least, it did when I tried it.

    EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try

    value_uni.encode('latin1')
    
    0 讨论(0)
  • 2020-12-09 18:32

    If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:

    >>> u'Andr\xc3\xa9'.encode('latin1')
    'Andr\xc3\xa9'
    

    Now it is a byte string that can be decoded correctly with utf8:

    >>> 'Andr\xc3\xa9'.decode('utf8')
    u'Andr\xe9'
    >>> print 'Andr\xc3\xa9'.decode('utf8')
    André
    

    In one step:

    >>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
    André
    
    0 讨论(0)
  • 2020-12-09 18:36

    The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:

    v = u'Andr\xc3\xa9'
    s = ''.join(map(lambda x: chr(ord(x)),v))
    

    The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.

    0 讨论(0)
  • 2020-12-09 18:39

    You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

    But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

    >>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
    'Andr\xc3\xa9'
    

    Then decode it correctly:

    >>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
    u'Andr\xe9'    
    

    Now it is in the correct format.

    However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

    0 讨论(0)
提交回复
热议问题