可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen
when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
回答1:
Official docs.
urllib.unquote(
string)
Replace %xx
escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/')
yields '/~connolly/'
.
And then just decode.
Update: For Python 3, write the following:
urllib.parse.unquote(url)
Python 3 docs.
回答2:
And if you are using Python3
you could use:
urllib.parse.unquote(url)
回答3:
or urllib.unquote_plus
>>> import urllib >>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29') 'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)' >>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29') 'erythrocyte membrane protein 1, PfEMP1 (VAR)'
回答4:
回答5:
import re def unquote(url): return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)