Python - Decode UTF-16 file with BOM

二次信任 提交于 2019-12-03 09:34:43

问题


I have a UTF-16 LE file with BOM. I'd like to flip this file in to UTF-8 without BOM so I can parse it using Python.

The usual code that I use didn't do the trick, it returned unknown characters instead of the actual file contents.

f = open('dbo.chrRaces.Table.sql').read()
f = str(f).decode('utf-16le', errors='ignore').encode('utf8')
print f

What would be the proper way to decode this file so I can parse through it with f.readlines()?


回答1:


Firstly, you should read in binary mode, otherwise things will get confusing.

Then, check for and remove the BOM, since it is part of the file, but not part of the actual text.

import codecs
encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
bom= codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
encoded_text= encoded_text[len(bom):]                         #strip away the BOM
decoded_text= encoded_text.decode('utf-16le')                 #decode to unicode

Don't encode (to utf-8 or otherwise) until you're done with all parsing/processing. You should do all that using unicode strings.

Also, errors='ignore' on decode may be a bad idea. Consider what's worse: having your program tell you something is wrong and stop, or returning wrong data?



来源:https://stackoverflow.com/questions/22459020/python-decode-utf-16-file-with-bom

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!