UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

假装没事ソ 提交于 2021-01-29 10:06:09

问题


I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. I have about 435 files that I wanted to run through but I keep getting this error.

I've tried converting the HTML files to text and opening the text files but I get the same error...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

回答1:


There are various approaches to dealing with text data with unknown encodings. However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. Beautiful Soup will automatically decode bytes to unicode.

In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open function. This causes an error if the file's contents are not valid UTF-8.

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

Instead, read the html files in binary mode and pass the resulting bytes instance to Beautiful Soup.

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding.

>>> '’'.encode('cp1252')
b'\x92'


来源:https://stackoverflow.com/questions/55857074/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x92-in-position-2893-invali

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!