Python 3: CSV files and Unicode Error

被刻印的时光 ゝ 提交于 2020-06-18 15:41:48

问题


I have a csv (tsv) file with this header

"Message Name"  "Field" "Base Label"    "Base Label Update Date"    "Translated Label"  "Translated Label Update Date"  "Language"
"Message"   "subject_template"  "New Task: Assess Distribution Outcomes for ""${docNameNoLink}"", ""${docNumber}""" "8/10/16 4:17:43 PM"    "Nouvelle tâche : évaluez le résultat de la distribution de « ${docNameNoLink} »."  "2/17/14 5:09:10 AM"    "fr"

When I try to read the file with this code

import csv
with open(fileName, 'r',  encoding='utf-8', errors='replace') as fdata:
    csv.register_dialect('tsv', delimiter='\t', quoting=csv.QUOTE_NONE)
    reader=csv.reader(fdata, dialect='tsv')
    try:
        for row in reader:
            print (row)
    except csv.Error as e:
        sys.exit('file{}, line {}: {}'.format(fileName, reader.line_num, e))

I get the message error: file NameFile, line 1: line contains NULL byte

However, if I run this code without the part of errors='replace|ignore', same code:

with open(fileName, 'r',  encoding='utf-8') as fdata:
    csv.register_dialect('tsv', delimiter='\t', quoting=csv.QUOTE_NONE)
    reader=csv.reader(fdata, dialect='tsv')
    try:
        for row in reader:
            print (row)
    except csv.Error as e:
        sys.exit('file {}, line {}: {}'.format(fileName, reader.line_num, e))

I got the following message error:

File "csvFiles.py", line 76 in <module>
  for row in reader:
   File "c:\Python35\lib\codecs.py", line 321 in decode (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

What is the possible reason of this error and how can I can correct it and make the script work?


回答1:


Your data is not encoded in 'utf-8' but in 'utf-16-le' or something similar. 'utf-16-le' is just a guess. When I encode your data with 'utf-16-le' exactly the same errors are produced. Check the encoding of your data file. In Linux you can use an editor like emacs for that or the 'file' utility.

The error message itself tells us that the first byte of your file is 0xff. This is, potentially, part of the Byte-Order Mark.




回答2:


If you just make one change in the code line than it might get work

with open(fileName, 'r',  encoding='utf-16') as fdata:



回答3:


For some reason, python does not like a single backslash. Try it again but replace all of your single backslashes with two. Goodluck.



来源:https://stackoverflow.com/questions/41725308/python-3-csv-files-and-unicode-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!