Is it possible for Python to read non-ascii text from file?

点点圈 提交于 2020-01-04 02:11:06

问题


I have a .txt file that is UTF-8 formatted and have problems to read it into Python. I have a large number of files and a conversion would be cumbersome.

So if I read the file in via

for line in file_obj:
    ...

I get the following error:

  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 291: ordinal not in range(128)

I guess x.decode("utf-8") wouldn't work since the error occurs before the line is even read in.


回答1:


There are two choices.

  1. Specify the encoding when opening the file, instead of using the default.
  2. Open the file in binary mode, and explicitly decode from bytes to str.

The first is obviously the simpler one. You don't show how you're opening the file, but assuming your code looks like this:

with open(path) as file_obj:
    for line in file_obj:

Do this:

with open(path, encoding='utf-8') as file_obj:
    for line in file_obj:

That's it.

As the docs explain, if you don't specify an encoding in text mode:

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.

In some cases (e.g., any OS X, or linux with an appropriate configuration), locale.getpreferredencoding() will always be 'UTF-8'. But it'll obviously never be "automatically whatever's right for any file I might open". So if you know a file is UTF-8, you should specify it explicitly.




回答2:


For Python 2 and 3 solution, use codecs:

import codecs
file_obj = codecs.open('ur file', "r", "utf-8")

for line in file_obj:
    ...

Otherwise -- Python 3 -- use abarnert's solution



来源:https://stackoverflow.com/questions/15512741/is-it-possible-for-python-to-read-non-ascii-text-from-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!