Is it possible for Python to read non-ascii text from file?

问题

I have a .txt file that is UTF-8 formatted and have problems to read it into Python. I have a large number of files and a conversion would be cumbersome.

So if I read the file in via

for line in file_obj:
    ...

I get the following error:

  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 291: ordinal not in range(128)

I guess x.decode("utf-8") wouldn't work since the error occurs before the line is even read in.

回答1:

There are two choices.

Specify the encoding when opening the file, instead of using the default.
Open the file in binary mode, and explicitly decode from bytes to str.

The first is obviously the simpler one. You don't show how you're opening the file, but assuming your code looks like this:

with open(path) as file_obj:
    for line in file_obj:

Do this:

with open(path, encoding='utf-8') as file_obj:
    for line in file_obj:

That's it.

As the docs explain, if you don't specify an encoding in text mode:

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.

In some cases (e.g., any OS X, or linux with an appropriate configuration), locale.getpreferredencoding() will always be 'UTF-8'. But it'll obviously never be "automatically whatever's right for any file I might open". So if you know a file is UTF-8, you should specify it explicitly.

回答2:

For Python 2 and 3 solution, use codecs:

import codecs
file_obj = codecs.open('ur file', "r", "utf-8")

for line in file_obj:
    ...

Otherwise -- Python 3 -- use abarnert's solution

来源：https://stackoverflow.com/questions/15512741/is-it-possible-for-python-to-read-non-ascii-text-from-file

标签

python

ascii

decode