“illegal multibyte sequence” error from BeautifulSoup when Python 3

淺唱寂寞╮ 提交于 2020-01-14 05:48:06

问题


.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.

It worked all fine until lately it's changed to Python 3.

I tested the same .html file in another machine Python 2, it works and returned the page contents.

soup = BeautifulSoup(open('page.html'), "lxml")

Machine with Python 3 doesn't work, and it says:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

How can I use Python 3 to parse this html page?

Thank you.


回答1:


It worked all fine until lately it's changed to Python 3.

Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it. Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is. Try opening page.html as a byte object (open('page.html', 'rb')) and see if that works for you.




回答2:


2 changes I done and not sure which one (or both) took the effect.

The computer was formatted and reinstalled so some settings are different.

1.In the language settings,

Administrative language settings > Change system locale > 

Tick the box

Beta: Use Unicode UTF-8 for worldwide language support

2.on the coding, for example, this is the original line:

print (soup.find_all('span', attrs={'class': 'listing-row__price'})[0].text.strip().encode("utf-8"))

When the part ".encode("utf-8")" was removed, it worked.

  • update on 16th Oct. 2019 Above change works, but when the box is ticked. Fonts and texts in foreign language software doesn't display properly.

    Beta: Use Unicode UTF-8 for worldwide language support
    

When the box was unticked, Fonts and texts in foreign language software are displayed well. But, problem in the question remains.

Solution with the box unticked - both foreign language software and Python codes work:

soup = BeautifulSoup(open(pages, 'r', encoding = 'utf-8', errors='ignore'), "lxml")


来源:https://stackoverflow.com/questions/58300101/illegal-multibyte-sequence-error-from-beautifulsoup-when-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!