If I have an object like:
d = {\'a\':1, \'en\': \'hello\'}
...then I can pass it to urllib.urlencode
, no problem:
Nothing new to add except to point out that the urlencode algorithm is nothing tricky. Rather than processing your data once and then calling urlencode on it, it would be perfectly fine to do something like:
from urllib import quote_plus
def urlencode_utf8(params):
if hasattr(params, 'items'):
params = params.items()
return '&'.join(
(quote_plus(k.encode('utf8'), safe='/') + '=' + quote_plus(v.encode('utf8'), safe='/')
for k, v in params))
Looking at the source code for the urllib module (Python 2.6), their implementation does not do much more. There is an optional feature where values in the parameters that are themselves 2-tuples are turned into separate key-value pairs, which is sometimes useful, but if you know you won't need that, the above will do.
You can even get rid of the if hasattr('items', params):
if you know you won't need to handle lists of 2-tuples as well as dicts.
this one line working fine in my case -->
urllib.quote(unicode_string.encode('utf-8'))
thanks @IanCleland and @PavelVlasov
Seems like it is a wider topic than it looks, especially when you have to deal with more complex dictionary values. I found 3 ways of solving the problem:
Patch urllib.py to include encoding parameter:
def urlencode(query, doseq=0, encoding='ascii'):
and replace all str(v)
conversions to something like v.encode(encoding)
Obviously not good, since it's hardly redistributable and even harder to maintain.
Change default Python encoding as described here. The author of the blog pretty clearly describes some problems with this solution and who knows how more of them could be lurking in the shadows. So it doesn't look good to me either.
So I, personally, ended up with this abomination, which encodes all unicode strings to UTF-8 byte strings in any (reasonably) complex structure:
def encode_obj(in_obj):
def encode_list(in_list):
out_list = []
for el in in_list:
out_list.append(encode_obj(el))
return out_list
def encode_dict(in_dict):
out_dict = {}
for k, v in in_dict.iteritems():
out_dict[k] = encode_obj(v)
return out_dict
if isinstance(in_obj, unicode):
return in_obj.encode('utf-8')
elif isinstance(in_obj, list):
return encode_list(in_obj)
elif isinstance(in_obj, tuple):
return tuple(encode_list(in_obj))
elif isinstance(in_obj, dict):
return encode_dict(in_obj)
return in_obj
You can use it like this: urllib.urlencode(encode_obj(complex_dictionary))
To encode keys also, out_dict[k]
can be replaced with out_dict[k.encode('utf-8')]
, but it was a bit too much for me.
It seems that you can't pass a Unicode object to urlencode, so, before calling it, you should encode every unicode object parameter. How you do this in a proper way seems to me very dependent on the context, but in your code you should always be aware of when to use the unicode python object (the unicode representation) and when to use the encoded object (bytestring).
Also, encoding the str values is "superfluous": What is the difference between encode/decode?
Why so long answers?
urlencode(unicode_string.encode('utf-8'))
I had the same problem with German "Umlaute". The solution is pretty simple:
In Python 3+, urlencode allows to specify the encoding:
from urllib import urlencode
args = {}
args = {'a':1, 'en': 'hello', 'pt': u'olá'}
urlencode(args, 'utf-8')
>>> 'a=1&en=hello&pt=ol%3F'