HTMLParser.HTMLParser().unescape() doesn't work

放肆的年华 提交于 2021-01-27 06:31:55

问题


I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc.

I've read several posts regarding this question

Converting html source content into readable format with Python 2.x

Decode HTML entities in Python string?

Convert XML/HTML Entities into Unicode String in Python

and according to them, I chose to use the undocumented function unescape(), but it doesn't work for me...

My code sample is like:

import HTMLParser

htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('© 2013')
print decoded

When I ran this python script, the output is still:

© 2013

instead of

© 2013

I'm using Python 2.X, working on Windows 7 and Cygwin console. I googled and didn't find any similar problems..Could anyone help me with this?


回答1:


Apparently HTMLParser.unescape was a bit more primitive before Python 2.6.

Python 2.5:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
'©'

Python 2.6/2.7:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'

See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation




回答2:


This site lists some solutions, here's one of them:

from xml.sax.saxutils import escape, unescape

html_escape_table = {
    '"': """,
    "'": "'",
    "©": "©"
    # etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text, html_unescape_table)

Not the prettiest thing though, since you would have to list each escaped symbol manually.

EDIT:

How about this?

import htmllib

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()


来源:https://stackoverflow.com/questions/17751439/htmlparser-htmlparser-unescape-doesnt-work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!