问题
The short version: I have a variable s = 'bär'
. I need to convert s
to ASCII so that s = 'b%C3%A4r'
.
Long version:
I'm using urllib.request.urlopen()
to read an mp3 pronunciation file from URL. This has worked very well, except I ran into a problem because the URLs often contain unicode characters. For example, the German "Bär". The full URL is https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär
. Indeed, typing this into Chrome as a URL works, and navigates me to the mp3 file without problems. However, feeding this same URL to urllib
creates a problem.
I determined this was a unicode problem because the stack-trace reads:
Traceback (most recent call last):
File "importer.py", line 145, in <module>
download_file(tuple[1], tuple[0], ".mp3")
File "importer.py", line 81, in download_file
with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`.
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
response = self._open(req, data)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
self._send_request(method, url, body, headers)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128)
... and other than the obvious UnicodeEncodeError
, I can see it's trying to encode()
to ASCII.
Interestingly, when I copied the URL from Chrome (instead of simply typing it into the Python interpreter), it translated the bär
to b%C3%A4r
. When I feed this to urllib.request.urlopen()
, it processes fine, because all of these characters are ASCII. So my goal is to make this conversion within my program. I tried to get my original string to the unicode equivalent, but unicodedata.normalize()
in all of its variants didn't work; further, I'm not sure how to store the Unicode as ASCII, given that Python 3 stores all strings as Unicode and thus makes no attempt to convert the text.
回答1:
Use urllib.parse.quote:
>>> urllib.parse.quote('bär')
'b%C3%A4r'
>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'
来源:https://stackoverflow.com/questions/36395705/unicode-string-in-urllib-request