urllib.urlencode doesn't like unicode values: how about this workaround?

前端 未结 8 735
予麋鹿
予麋鹿 2020-12-12 18:59

If I have an object like:

d = {\'a\':1, \'en\': \'hello\'}

...then I can pass it to urllib.urlencode, no problem:



        
相关标签:
8条回答
  • 2020-12-12 19:57

    You should indeed be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrifying. It violates the fundamental principle of working with string data: decode at input time, work exclusively in unicode, encode at output time.

    Update in response to comment:

    You are about to output some sort of HTTP request. This needs to be prepared as a byte string. The fact that urllib.urlencode is not capable of properly preparing that byte string if there are unicode characters with ordinal >= 128 in your dict is indeed unfortunate. If you have a mixture of byte strings and unicode strings in your dict, you need to be careful. Let's examine just what urlencode() does:

    >>> import urllib
    >>> tests = ['\x80', '\xe2\x82\xac', 1, '1', u'1', u'\x80', u'\u20ac']
    >>> for test in tests:
    ...     print repr(test), repr(urllib.urlencode({'a':test}))
    ...
    '\x80' 'a=%80'
    '\xe2\x82\xac' 'a=%E2%82%AC'
    1 'a=1'
    '1' 'a=1'
    u'1' 'a=1'
    u'\x80'
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "C:\python27\lib\urllib.py", line 1282, in urlencode
        v = quote_plus(str(v))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)
    

    The last two tests demonstrate the problem with urlencode(). Now let's look at the str tests.

    If you insist on having a mixture, then you should at the very least ensure that the str objects are encoded in UTF-8.

    '\x80' is suspicious -- it is not the result of any_valid_unicode_string.encode('utf8').
    '\xe2\x82\xac' is OK; it's the result of u'\u20ac'.encode('utf8').
    '1' is OK -- all ASCII characters are OK on input to urlencode(), which will percent-encode such as '%' if necessary.

    Here's a suggested converter function. It doesn't mutate the input dict as well as returning it (as yours does); it returns a new dict. It forces an exception if a value is a str object but is not a valid UTF-8 string. By the way, your concern about it not handling nested objects is a little misdirected -- your code works only with dicts, and the concept of nested dicts doesn't really fly.

    def encoded_dict(in_dict):
        out_dict = {}
        for k, v in in_dict.iteritems():
            if isinstance(v, unicode):
                v = v.encode('utf8')
            elif isinstance(v, str):
                # Must be encoded in UTF-8
                v.decode('utf8')
            out_dict[k] = v
        return out_dict
    

    and here's the output, using the same tests in reverse order (because the nasty one is at the front this time):

    >>> for test in tests[::-1]:
    ...     print repr(test), repr(urllib.urlencode(encoded_dict({'a':test})))
    ...
    u'\u20ac' 'a=%E2%82%AC'
    u'\x80' 'a=%C2%80'
    u'1' 'a=1'
    '1' 'a=1'
    1 'a=1'
    '\xe2\x82\xac' 'a=%E2%82%AC'
    '\x80'
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "<stdin>", line 8, in encoded_dict
      File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
    >>>
    

    Does that help?

    0 讨论(0)
  • 2020-12-12 19:57

    I solved it with this add_get_to_url() method:

    import urllib
    
    def add_get_to_url(url, get):
       return '%s?%s' % (url, urllib.urlencode(list(encode_dict_to_bytes(get))))
    
    def encode_dict_to_bytes(query):
        if hasattr(query, 'items'):
            query=query.items()
        for key, value in query:
            yield (encode_value_to_bytes(key), encode_value_to_bytes(value))
    
    def encode_value_to_bytes(value):
        if not isinstance(value, unicode):
            return str(value)
        return value.encode('utf8')
    

    Features:

    • "get" can be a dict or a list of (key, value) pairs
    • Order is not lost
    • values can be integers or other simple datatypes.

    Feedback welcome.

    0 讨论(0)
提交回复
热议问题