How to convert BeautifulSoup.ResultSet to string

匿名 (未验证) 提交于 2019-12-03 03:02:02

问题:

So I parsed a html page with .findAll (BeautifulSoup) to variable named result. If I type result in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result) returns garbage, like this sample:

\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div> 

Html page source is utf-8 encoded

How can I handle this?


Code is basically this, in case it matters:

from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib.open(url).read()) result = soup.findAll(something) 

Python is 2.7

回答1:

Python 2.6.7 BeautifulSoup.version 3.2.0

This worked for me:

unicode.join(u'\n',map(unicode,result)) 

I'm pretty sure a result is a BeautifulSoup.ResultSet object, which seems to be an extension of the standard python list



回答2:

from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib.open(url).read()) #findAll should get multiple parsed result result = soup.findAll(something) #then iterate result for line in result:     #get str value from each line,replace charset with utf-8 or other charset you need     print line.__str__('charset') 

BTW:BeautifulSoup's version is beautifulsoup-3.2.1



回答3:

That's not garbage, that's UTF-8-encoded text. Use Unicode instead.



回答4:

Use this:

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore') 

Unicode has multiple normalization forms That output should not be garbage.
Use the originalEncoding attribute to verify the encoding scheme.
Regarding python's unicode implementations, refer this document (even for the normalization)



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!