How to iterate over Unicode characters in Python 3?

前端 未结 3 919
花落未央
花落未央 2020-12-10 03:48

I need to step through a Python string one character at a time, but a simple \"for\" loop gives me UTF-16 code units instead:

str = \"abc\\u20ac\\U00010302\\         


        
3条回答
  •  眼角桃花
    2020-12-10 04:27

    If you create the string as a unicode object, it should be able to break off a character at a time automatically. E.g.:

    Python 2.6:

    s = u"abc\u20ac\U00010302\U0010fffd"   # note u in front!
    for c in s:
        print "U+%04x" % ord(c)
    

    I received:

    U+0061
    U+0062
    U+0063
    U+20ac
    U+10302
    U+10fffd
    

    Python 3.2:

    s = "abc\u20ac\U00010302\U0010fffd"
    for c in s:
        print ("U+%04x" % ord(c))
    

    It worked for me:

    U+0061
    U+0062
    U+0063
    U+20ac
    U+10302
    U+10fffd
    

    Additionally, I found this link which explains that the behavior as working correctly. If the string came from a file, etc, it will likely need to be decoded first.

    Update:

    I've found an insightful explanation here. The internal Unicode representation size is a compile-time option, and if working with "wide" chars outside of the 16 bit plane you'll need to build python yourself to remove the limitation, or use one of the workarounds on this page. Apparently many Linux distros do this for you already as I encountered above.

提交回复
热议问题