问题
From reading various posts, it seems like JavaScript's unescape()
is equivalent to Pythons urllib.unquote()
, however when I test both I get different results:
In browser console:
unescape('%u003c%u0062%u0072%u003e');
output: <br>
In Python interpreter:
import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')
output: %u003c%u0062%u0072%u003e
I would expect Python to also return <br>
. Any ideas as to what I'm missing here?
Thanks!
回答1:
%uxxxx
is a non standard URL encoding scheme that is not supported by urllib.parse.unquote()
(Py 3) / urllib.unquote()
(Py 2).
It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC.
You could use a regular expression to convert such codepoints:
try:
unichr # only in Python 2
except NameError:
unichr = chr # Python 3
re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)
This decodes both the %uxxxx
and the %uxx
form ECMAScript 3rd ed can decode.
Demo:
>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted)
'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted)
'<br>'
but you should avoid using the encoding altogether if possible.
来源:https://stackoverflow.com/questions/23158822/javascript-unescape-vs-python-urllib-unquote