I need to step through a Python string one character at a time, but a simple \"for\" loop gives me UTF-16 code units instead:
str = \"abc\\u20ac\\U00010302\\
If you create the string as a unicode object, it should be able to break off a character at a time automatically. E.g.:
Python 2.6:
s = u"abc\u20ac\U00010302\U0010fffd" # note u in front!
for c in s:
print "U+%04x" % ord(c)
I received:
U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd
Python 3.2:
s = "abc\u20ac\U00010302\U0010fffd"
for c in s:
print ("U+%04x" % ord(c))
It worked for me:
U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd
Additionally, I found this link which explains that the behavior as working correctly. If the string came from a file, etc, it will likely need to be decoded first.
Update:
I've found an insightful explanation here. The internal Unicode representation size is a compile-time option, and if working with "wide" chars outside of the 16 bit plane you'll need to build python yourself to remove the limitation, or use one of the workarounds on this page. Apparently many Linux distros do this for you already as I encountered above.