Help Replacing Non-ASCII character in Python

纵然是瞬间 提交于 2019-12-08 12:18:21

问题


I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is the desired format.

How do I replace the 'Â ' with '&nbsp;' in Python? Thanks a lot!


回答1:


You've got an encoding problem. Instead of trying to remove this characters, look for the encoding of the page, then when you read the file, use the codecs module instead of open(), using the proper character encoding.




回答2:


filtered_content = filter(lambda x: x in string.printable, content)

This solved my problem. Thank you!




回答3:


s.replace('Â ', '&nbsp;');

However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.

http://code.google.com/p/httplib2/wiki/ExamplesPython3

EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib, urllib2, or httplib modules that are part of the Python 2.6 standard library?



来源:https://stackoverflow.com/questions/2921815/help-replacing-non-ascii-character-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!