可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
So I parsed a html page with .findAll (BeautifulSoup) to variable named result. If I type result in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result) returns garbage, like this sample:
\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>
Html page source is utf-8 encoded
How can I handle this?
Code is basically this, in case it matters:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib.open(url).read()) result = soup.findAll(something)
Python is 2.7
回答1:
Python 2.6.7 BeautifulSoup.version 3.2.0
This worked for me:
unicode.join(u'\n',map(unicode,result))
I'm pretty sure a result is a BeautifulSoup.ResultSet object, which seems to be an extension of the standard python list
回答2:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib.open(url).read()) #findAll should get multiple parsed result result = soup.findAll(something) #then iterate result for line in result: #get str value from each line,replace charset with utf-8 or other charset you need print line.__str__('charset')
BTW:BeautifulSoup's version is beautifulsoup-3.2.1
回答3:
That's not garbage, that's UTF-8-encoded text. Use Unicode instead.
回答4:
Use this:
unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
Unicode has multiple normalization forms That output should not be garbage.
Use the originalEncoding attribute to verify the encoding scheme.
Regarding python's unicode implementations, refer this document (even for the normalization)