Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

前端 未结 4 775
清歌不尽
清歌不尽 2020-12-05 13:18

Python\'s urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens:

In [5]: print urll         


        
4条回答
  •  被撕碎了的回忆
    2020-12-05 13:47

    Python's urllib.quote and urllib.unquote do not handle Unicode correctly

    urllib does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.

    IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib.

    Encoding the value to UTF8 also does not work:

    In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
    Cataño
    

    Ah, well now you're typing Unicode into a console, and doing print-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.

    Type it out the long way with backslash sequences and you can more easily see that the urllib bit does actually work:

    >>> u'Cata\u00F1o'.encode('utf-8')
    'Cata\xC3\xB1o'
    >>> urllib.quote(_)
    'Cata%C3%B1o'
    
    >>> urllib.unquote(_)
    'Cata\xC3\xB1o'
    >>> _.decode('utf-8')
    u'Cata\xF1o'
    

提交回复
热议问题