Python 3: CSV files and Unicode Error

问题

I have a csv (tsv) file with this header

"Message Name"  "Field" "Base Label"    "Base Label Update Date"    "Translated Label"  "Translated Label Update Date"  "Language"
"Message"   "subject_template"  "New Task: Assess Distribution Outcomes for ""${docNameNoLink}"", ""${docNumber}""" "8/10/16 4:17:43 PM"    "Nouvelle tâche : évaluez le résultat de la distribution de « ${docNameNoLink} »."  "2/17/14 5:09:10 AM"    "fr"

When I try to read the file with this code

import csv
with open(fileName, 'r',  encoding='utf-8', errors='replace') as fdata:
    csv.register_dialect('tsv', delimiter='\t', quoting=csv.QUOTE_NONE)
    reader=csv.reader(fdata, dialect='tsv')
    try:
        for row in reader:
            print (row)
    except csv.Error as e:
        sys.exit('file{}, line {}: {}'.format(fileName, reader.line_num, e))

I get the message error: file NameFile, line 1: line contains NULL byte

However, if I run this code without the part of errors='replace|ignore', same code:

with open(fileName, 'r',  encoding='utf-8') as fdata:
    csv.register_dialect('tsv', delimiter='\t', quoting=csv.QUOTE_NONE)
    reader=csv.reader(fdata, dialect='tsv')
    try:
        for row in reader:
            print (row)
    except csv.Error as e:
        sys.exit('file {}, line {}: {}'.format(fileName, reader.line_num, e))

I got the following message error:

File "csvFiles.py", line 76 in <module>
  for row in reader:
   File "c:\Python35\lib\codecs.py", line 321 in decode (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

What is the possible reason of this error and how can I can correct it and make the script work?

回答1:

Your data is not encoded in 'utf-8' but in 'utf-16-le' or something similar. 'utf-16-le' is just a guess. When I encode your data with 'utf-16-le' exactly the same errors are produced. Check the encoding of your data file. In Linux you can use an editor like emacs for that or the 'file' utility.

The error message itself tells us that the first byte of your file is 0xff. This is, potentially, part of the Byte-Order Mark.

回答2:

If you just make one change in the code line than it might get work

with open(fileName, 'r',  encoding='utf-16') as fdata:

回答3:

For some reason, python does not like a single backslash. Try it again but replace all of your single backslashes with two. Goodluck.

来源：https://stackoverflow.com/questions/41725308/python-3-csv-files-and-unicode-error

标签

python-3.x

csv

unicode