Chinese character encoding error with BeautifulSoup in Python?

心不动则不痛 提交于 2020-01-04 05:23:07

问题


I'd like to use BeatifulSoup to get the data in a table from a website, but it couldn't grab the Chinese character correctly. This is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.515fa.com/che_1978.html").read()
soup=BeautifulSoup(html,from_encoding="UTF-8")
print soup.prettify()

And the Chinese characters are displayed like this:

<td align="center" bgcolor="#FFFFFF" u1:str="" width="173">
               ćé¸</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="149">
               ä¸ćľˇĺ¤§äź</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="126">
               大äź</td>

I really don't know what the "ä¸ćľˇĺ¤§äź" is. I tried to change the encoding "utf-8" to "gb18030", but it didn't work. How can I get the correct Chinese characters? Thanks!


回答1:


Try:

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Not sure what exactly BeautifulSoup(from_encoding=) did but this did the trick.



来源:https://stackoverflow.com/questions/32176558/chinese-character-encoding-error-with-beautifulsoup-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!