Decode escaped characters in URL

匿名 (未验证) 提交于 2019-12-03 02:08:02

问题:

I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:

http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh  

Is there a way to transform them back to their unescaped form in python?

P.S.: The URLs are encoded in utf-8

回答1:

Official docs.

urllib.unquote(string)

Replace %xx escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

And then just decode.


Update: For Python 3, write the following:

urllib.parse.unquote(url) 

Python 3 docs.



回答2:

And if you are using Python3 you could use:

urllib.parse.unquote(url) 


回答3:

or urllib.unquote_plus

>>> import urllib >>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29') 'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)' >>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29') 'erythrocyte membrane protein 1, PfEMP1 (VAR)' 


回答4:

You can use urllib.unquote



回答5:

import re  def unquote(url):   return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!