Why does Python's urllib2.urlopen() raise an HTTPError for successful status codes?

问题

According to the urllib2 documentation,

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.

And yet the following code

request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)

raises an HTTPError with code 201 (created):

ERROR    2011-08-11 20:40:17,318 __init__.py:463] HTTP Error 201: Created

So why is urllib2 throwing HTTPErrors on this successful request?

It's not too much of a pain; I can easily extend the code to:

try:
    request = urllib2.Request(url, data, headers)
    response = urllib2.urlopen(request)
except HTTPError, e:
    if e.code == 201:
        # success! :)
    else:
        # fail! :(
else:
    # when will this happen...?

But this doesn't seem like the intended behavior, based on the documentation and the fact that I can't find similar questions about this odd behavior.

Also, what should the else block be expecting? If successful status codes are all interpreted as HTTPErrors, then when does urllib2.urlopen() just return a normal file-like response object like all the urllib2 documentation refers to?

回答1:

As the actual library documentation mentions:

For 200 error codes, the response object is returned immediately.

For non-200 error codes, this simply passes the job on to the protocol_error_code handler methods, via OpenerDirector.error(). Eventually, urllib2.HTTPDefaultErrorHandler will raise an HTTPError if no other handler handles the error.

http://docs.python.org/library/urllib2.html#httperrorprocessor-objects

回答2:

You can write a custom Handler class for use with urllib2 to prevent specific error codes from being raised as HTTError. Here's one I've used before:

class BetterHTTPErrorProcessor(urllib2.BaseHandler):
    # a substitute/supplement to urllib2.HTTPErrorProcessor
    # that doesn't raise exceptions on status codes 201,204,206
    def http_error_201(self, request, response, code, msg, hdrs):
        return response
    def http_error_204(self, request, response, code, msg, hdrs):
        return response
    def http_error_206(self, request, response, code, msg, hdrs):
        return response

Then you can use it like:

opener = urllib2.build_opener(self.BetterHTTPErrorProcessor)
urllib2.install_opener(opener)

req = urllib2.Request(url, data, headers)
urllib2.urlopen(req)

回答3:

I personally think it was a mistake and very nonintuitive for this to be the default behavior. It's true that non-2XX codes imply a protocol level error, but turning that into an exception is too far (in my opinion at least).

In any case, I think the most elegant way to avoid this is:

opener = urllib.request.build_opener()
for processor in opener.process_response['https']: # or http, depending on what you're using
   if isinstance(processor, urllib.request.HTTPErrorProcessor): # HTTPErrorProcessor also for https
       opener.process_response['https'].remove(processor)
       break # there's only one such handler by default
response = opener.open('https://www.google.com')

Now you have the response object. You can check it's status code, headers, body, etc.

来源：https://stackoverflow.com/questions/7032890/why-does-pythons-urllib2-urlopen-raise-an-httperror-for-successful-status-cod

标签

python

urllib2

http-status-codes