How to unquote a urlencoded unicode string in python?

后端 未结 5 1617
时光说笑
时光说笑 2020-11-29 00:23

I have a unicode string like \"Tanım\" which is encoded as \"Tan%u0131m\" somehow. How can i convert this encoded string back to original unicode. Apparently urllib.unquote

5条回答
  •  臣服心动
    2020-11-29 00:42

    You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.

    Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.

    With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):

    try:
        # Python 3
        from urllib.parse import unquote
        unichr = chr
    except ImportError:
        # Python 2
        from urllib import unquote
    
    def unquote_unicode(string, _cache={}):
        string = unquote(string)  # handle two-digit %hh components first
        parts = string.split(u'%u')
        if len(parts) == 1:
            return parts
        r = [parts[0]]
        append = r.append
        for part in parts[1:]:
            try:
                digits = part[:4].lower()
                if len(digits) < 4:
                    raise ValueError
                ch = _cache.get(digits)
                if ch is None:
                    ch = _cache[digits] = unichr(int(digits, 16))
                if (
                    not r[-1] and
                    u'\uDC00' <= ch <= u'\uDFFF' and
                    u'\uD800' <= r[-2] <= u'\uDBFF'
                ):
                    # UTF-16 surrogate pair, replace with single non-BMP codepoint
                    r[-2] = (r[-2] + ch).encode(
                        'utf-16', 'surrogatepass').decode('utf-16')
                else:
                    append(ch)
                append(part[4:])
            except ValueError:
                append(u'%u')
                append(part)
        return u''.join(r)
    

    The function is heavily inspired by the current standard-library implementation.

    Demo:

    >>> print(unquote_unicode('Tan%u0131m'))
    Tanım
    >>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
    איך ממירים את הטקסט הזה
    >>> print(unquote_unicode('%ud83c%udfd6'))  # surrogate pair
    

提交回复
热议问题