Python 2.7: Read file with Chinese characters

问题

I am trying to analyze data within CSV files with Chinese characters in their names (E.g. "粗1 25g"). I am using Tkinter to choose the files like so:

selectedFiles = askopenfilenames(filetypes=[("xlsx","*"),("xls","*")]) # Utilize Tkinker dialog window to choose files
selectedFiles = master.tk.splitlist(selectedFiles) # Create list from files chosen

I have attempted to convert the filename to unicode in this way:

selectedFiles = [x.decode("utf-8") for x in selectedFiles]

Only to yield the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 0: ordinal not in range(128)

I have also tried converting the filenames as the files are created with the following:

titles = [x.encode('utf-8') for x in titles]

Only to receive the error:

IOError: [Errno 22] invalid mode ('wb') or filename: 'C:\...\\data_division_files\\\xe7\xb2\x971 25g.csv'

I have also tried combinations of the above methods to no avail. What can I do to allow these files to be read in Python?

(This question,while related, has not been able to solve my problem: Obtain File size with os.path.getsize() in Python 2.7.5)

回答1:

When you call decode on a unicode object, it first encodes it with sys.getdefaultencoding() so it can decode it for you. Which is why you get an error about ASCII even though you didn't ask for ASCII anywhere.

So, where are you getting a unicode object from? From askopenfilename. From a quick test, it looks like it always returns unicode values on Windows (presumably by getting the UTF-16 and decoding it), while on POSIX it returns some unicode and some str (I'd guess by leaving alone anything that fits into 7-bit ASCII, decoding anything else with your filesystem encoding). If you'd tried printing out the repr or type or anything of selectedFiles, the problem would have been obvious.

Meanwhile, the encode('utf-8') shouldn't cause any UnicodeErrors… but it's likely that your filesystem encoding isn't UTF-8 on Windows, so it will probably cause a lot of IOErrors with errno 2 (trying to open files that don't exist, or to create files in directories that don't exist), 21 (trying to open files with illegal file or directory names on Windows), etc. And it looks like that's exactly what you're seeing. And there's really no reason to do it; just pass the pathnames as-is to open and they'll be fine.

So, basically, if you removed all of your encode and decode calls, your code would probably just work.

However, there's an even easier solution: Just use askopenfile or asksaveasfile instead of askopenfilename or asksaveasfilename. Let Tk figure out how to use its pathnames and just hand you the file objects, instead of messing with the pathnames yourself.

来源：https://stackoverflow.com/questions/19444296/python-2-7-read-file-with-chinese-characters

标签

python

unicode

encoding

character

filenames