'utf-16-le' codec can't decode bytes while reading EXCEL in PYTHON

僤鯓⒐⒋嵵緔 提交于 2021-02-05 09:28:10

问题


I am trying to read various numbers of xls files with different languages, Arabic, Greek, Italian, Hebrew, etc. and I get the error shown below when I try to call open_workbook function, any idea how can I set the format to any language?

Code:

book = xlrd.open_workbook(workbook_url)

Error:

return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 372-373: unexpected end of data


回答1:


It's unlikely that language is the issue. More likely is that xlrd is having trouble detecting the encoding of the .xlsx file.

As xlrd notes in the documentation on handling of unicode:

This package presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as UTF-16LE (a 16-bit Unicode Transformation Format). Older files (Excel 95 and earlier) don’t keep strings in Unicode; a CODEPAGE record provides a codepage number (for example, 1252) which is used by xlrd to derive the encoding (for same example: “cp1252”) which is used to translate to Unicode.

My first step to look at this would be to determine the actual encoding. How old is the file and how was it was created (actual Excel? or via a 3rd party tool).

You could look for the CODEPAGE record by opening the file in a text/hex editor and then try to force that encoding.

It sounds to me based on the error that it isn't utf-16le (the default assumption of xlrd), so you're going to have to determine it somehow or else start trying random encodings eg:

book = xlrd.open_workbook(..., encoding_override="cp1252")
book = xlrd.open_workbook(..., encoding_override="utf-8")
book = xlrd.open_workbook(..., encoding_override="latin-1")


来源:https://stackoverflow.com/questions/63478773/utf-16-le-codec-cant-decode-bytes-while-reading-excel-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!