Read lines of a textfile and getting charmap decode error

北慕城南 提交于 2019-12-06 10:39:16

I checked the file, and the root of the problem seems to be that the file contains words in at least two encodings: probably cp1252 and cp850. The character 0x81 is ü in cp850 but undefined in cp1252. You can handle that situation by catching the exception, but some other German characters map to valid but wrong characters in cp1252. If you are happy with such an imperfect solution, here's how you could do it:

with open('sorted.de.word.unigrams','rb') as f: #open in binary mode
    for line in f:
        for cp in ('cp1252', 'cp850'):
            try:
                s = line.decode(cp)
            except UnicodeDecodeError:
                pass
            else:
                store_to_db(s)
                break

Try

data = []
import codecs
with codecs.open('sorted.de.word.unigrams', 'r') as f:
    for line in f:
         data.append(line)

If you want to ignore error, you can do

try:
    # Your code that enter data to database
except UnicodeDecodeError:
    pass

This usually happens when there is encoding mismatch.

0x81 does not mean anything, try specifying the encoding

file = open(filename, encoding="utf8")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!