The following unicode and string can exist on their own if defined explicitly:
>>> value_str=\'Andr\\xc3\\xa9\'
>>> value_uni=u\'Andr\\xc3\
Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.
To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8')
.
You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Another excellent article (this time Python-specific): Unicode HOWTO
value_uni.encode('utf8')
or whatever encoding you need.
See http://docs.python.org/library/stdtypes.html#str.encode
It seems like
str(value_uni)
should work... at least, it did when I tried it.
EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try
value_uni.encode('latin1')
If you have u'Andr\xc3\xa9'
, that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1
) encoding. So:
>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'
Now it is a byte string that can be decoded correctly with utf8
:
>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André
In one step:
>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André
The OP is not converting to ascii nor utf-8. That's why the suggested encode
methods won't work. Try this:
v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))
The chr(ord(x))
business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join
call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.
You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9'
which is equivalent to 'André'
.
But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'
Then decode it correctly:
>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'
Now it is in the correct format.
However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.