Why I'm getting “UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to <undefined>” error?

别来无恙 提交于 2020-12-13 04:06:12

问题


I'm getting UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to error while running this code::

from bs4 import BeautifulSoup
import requests
r = requests.get('https://stackoverflow.com').text
soup = BeautifulSoup(r, 'lxml')
print(soup.prettify())

and the output is:

Traceback (most recent call last):
  File "c:\Users\Asus\Documents\Hello World\Web Scraping\st.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>

I'm using python 3.8.1 and UTF-8 in vs code. How to solve this?


回答1:


There are hints in the full error message... I will keep here what seems most important:

Traceback ...
  File "...\cp1252.py", ...
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' ...

The error is caused by the print call. Somewhere in you text, you have a ZERO WIDTH SPACE character (Unicode U+200B), and if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252 here). And the ZERO WIDTH SPACE is not represented in that code page. BTW the default console is not really unicode friendly in Windows.

There is little to do in a Windows console. I would advise you to try one of these workarounds:

  • do not print to the console but write to a (utf8) file. You will then be able to read it with a utf8 enabled text editor like notepad++

  • manually encode anything before printing it, with errors='ignore' or errors='replace'. That way, possibly offending characters will be ignored and no error will arise

      print(soup.prettify().encode('cp1252', errors='ignore'))
    



回答2:


You can explore little bit on your own... but for python 2.7 what i usually do is use this to clean my text:

text = text.encode('utf-8').decode('ascii', 'ignore')

python 3 equivalent for this is simply:

text = str(text)

For your case, try this:

r = requests.get('https://stackoverflow.com').text.encode('utf8').decode('ascii', 'ignore')

otherwise normally:

r = requests.get('https://stackoverflow.com')
soup = BeautifulSoup(r.content, 'lxml')
print soup

(I don't think this should give any error.)



来源:https://stackoverflow.com/questions/62656579/why-im-getting-unicodeencodeerror-charmap-codec-cant-encode-character-u2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!