The solutions in other answers do not work when I try them, the same string outputs when I try those methods.
I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form x
where 120 seems to represent the ascii code. I tried using HTMLParser()
and decode()
methods but nothing seems to work.
Please note that what I have from the webpage in the format are only those characters.
Example:
Blasterjaxx 
Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.
Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.
However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,
s ='Blasterjaxx '
print ''.join([chr(int(u)) for u in s.split('&#') if u])
output
Blasterjaxx
The if u
skips over the initial empty string that we get because s
begins with the splitting string '&#'
. Alternatively, we could skip it by slicing:
''.join([chr(int(u)) for u in s.split('&#')[1:]])
The correct format for character reference is &#nnnn;
so the ;
is missing in your example. You can add the ;
and then use HTMLParser.unescape() :
from HTMLParser import HTMLParser
import re
x ='Blasterjaxx '
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)
This gives this output :
Blasterjaxx 
Blasterjaxx
In Python 3, use the html
module:
>>> import html
>>> html.unescape('Blasterjaxx ')
'Blasterjaxx '
来源:https://stackoverflow.com/questions/38479865/decoding-ampersand-hash-strings-12412097etc