Unicode String in urllib.request [duplicate]

问题

The short version: I have a variable s = 'bär'. I need to convert s to ASCII so that s = 'b%C3%A4r'.

Long version:

I'm using urllib.request.urlopen() to read an mp3 pronunciation file from URL. This has worked very well, except I ran into a problem because the URLs often contain unicode characters. For example, the German "Bär". The full URL is https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär. Indeed, typing this into Chrome as a URL works, and navigates me to the mp3 file without problems. However, feeding this same URL to urllib creates a problem.

I determined this was a unicode problem because the stack-trace reads:

Traceback (most recent call last):
  File "importer.py", line 145, in <module>
    download_file(tuple[1], tuple[0], ".mp3")
  File "importer.py", line 81, in download_file
    with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`.
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
    response = self._open(req, data)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
    '_open', req)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
    result = func(*args)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
    self._send_request(method, url, body, headers)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128)

... and other than the obvious UnicodeEncodeError, I can see it's trying to encode() to ASCII.

Interestingly, when I copied the URL from Chrome (instead of simply typing it into the Python interpreter), it translated the bär to b%C3%A4r. When I feed this to urllib.request.urlopen(), it processes fine, because all of these characters are ASCII. So my goal is to make this conversion within my program. I tried to get my original string to the unicode equivalent, but unicodedata.normalize() in all of its variants didn't work; further, I'm not sure how to store the Unicode as ASCII, given that Python 3 stores all strings as Unicode and thus makes no attempt to convert the text.

回答1:

Use urllib.parse.quote:

>>> urllib.parse.quote('bär')
'b%C3%A4r'

>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
...                      urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'

来源：https://stackoverflow.com/questions/36395705/unicode-string-in-urllib-request

标签

python

python-3.x

unicode

encoding