Is there an easy way to make unicode work in python?

后端 未结 5 2267
难免孤独
难免孤独 2021-01-04 09:22

I\'m trying to deal with unicode in python 2.7.2. I know there is the .encode(\'utf-8\') thing but 1/2 the time when I add it, I get errors, and 1/2 the time wh

5条回答
  •  忘掉有多难
    2021-01-04 09:48

    There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.

    Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.


    In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:

    encoded_title = title
    if isinstance(encoded_title, unicode):
        encoded_title = title.encode('utf-8')
    

    If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:

    python -Werror -municodenazi myprog.py
    

    This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

提交回复
热议问题