Use soup.get_text() with UTF-8

旧时模样 提交于 2021-02-18 11:40:35

问题


I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error:


UnicodeEncodeError in soup.py:16
  'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence

I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!

Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:


texts = soup.findAll(text=True)

   def visible(element):
      if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
         return False
      elif re.match('', str(element)): return False
      elif re.match('\n', str(element)): return False
      return True

   visible_texts = filter(visible, texts)

   print visible_texts

Error is different, though. Progress?


UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)

回答1:


soup.get_text() returns a Unicode string that's why you're getting the error.

You can solve this in a number of ways including setting the encoding at the shell level.

export PYTHONIOENCODING=UTF-8

You can reload sys and set the encoding by including this in your script.

if __name__ == "__main__":
  reload(sys)
  sys.setdefaultencoding("utf-8")

Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:

import urllib
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# get text
text = soup.get_text()

print(text.encode('utf-8'))



回答2:


You can't do str(text) if you may be dealing with unicode on the page. Instead of str(), use unicode().



来源:https://stackoverflow.com/questions/11179878/use-soup-get-text-with-utf-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!