UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte error in python while reading a csv file

十年热恋 提交于 2021-02-08 03:37:07

问题


StopWords = pd.read_csv('stopwords.csv',encoding='UTF-8', quotechar='|',names=['StopWords'])

I am trying to read a CSV file that contains Persian language text, and this is the error I get:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte


回答1:


Without seeing the binary content of the file it is difficult to guess the actual encoding but UTF-8, with or without a BOM (Byte order Marker) cannot start with an 0xFF.

If it starts with an 0xFF, then that would suggest that it is probably in Little Endian UTF-16 to UTF-32 which are the only Unicode serialisations that have a byte order marker starting with 0xFF.

https://en.wikipedia.org/wiki/Byte_order_mark gives some explanation.

It is also possible that it is a Persian specific character set. National character sets should be avoided if a Unicode option is available, for the generation of your source CSV files.



来源:https://stackoverflow.com/questions/58199571/unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-invalid

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!