python get headers only using urllib2

问题

I have to implement a function to get headers only (without doing a GET or POST) using urllib2. Here is my function:

def getheadersonly(url, redirections = True):
    if not redirections:
        class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
            def http_error_302(self, req, fp, code, msg, headers):
                return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
            http_error_301 = http_error_303 = http_error_307 = http_error_302
        cookieprocessor = urllib2.HTTPCookieProcessor()
        opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
        urllib2.install_opener(opener)

    class HeadRequest(urllib2.Request):
        def get_method(self):
            return "HEAD"

    info = {}
    info['headers'] = dict(urllib2.urlopen(HeadRequest(url)).info()) 
    info['finalurl'] = urllib2.urlopen(HeadRequest(url)).geturl() 
    return info

Uses code from answer this and this. However this is doing redirection even when the flag is False. I tried the code with:

print getheadersonly("http://ms.com", redirections = False)['finalurl']
print getheadersonly("http://ms.com")['finalurl']

Its giving morganstanley.com in both cases. What is wrong here?

回答1:

Firstly, your code contains several bugs:

On each request of getheadersonly you install a new global urlopener which is then used in subsequent calls of urllib2.urlopen
You make two HTTP-requests to get two different attributes of a response.
The implementation of urllib2.HTTPRedirectHandler.http_error_302 is not so trivial and I do not understand how can it prevent redirections in the first place.

Basically, you should understand that each handler is installed in an opener to handle certain kind of response. urllib2.HTTPRedirectHandler is there to convert certain http-codes into a redirections. If you do not want redirections, do not add a redirection handler into the opener. If you do not want to open ftp links, do not add FTPHandler, etc.

That is all you need is to create a new opener and add the urllib2.HTTPHandler() in it, customize the request to be 'HEAD' request and pass an instance of the request to the opener, read the attributes, and close the response.

class HeadRequest(urllib2.Request):
    def get_method(self):
        return 'HEAD'

def getheadersonly(url, redirections=True):
    opener = urllib2.OpenerDirector()
    opener.add_handler(urllib2.HTTPHandler())
    opener.add_handler(urllib2.HTTPDefaultErrorHandler())
    if redirections:
        # HTTPErrorProcessor makes HTTPRedirectHandler work
        opener.add_handler(urllib2.HTTPErrorProcessor())
        opener.add_handler(urllib2.HTTPRedirectHandler())
    try:
        res = opener.open(HeadRequest(url))
    except urllib2.HTTPError, res:
        pass
    res.close()
    return dict(code=res.code, headers=res.info(), finalurl=res.geturl())

回答2:

You can send a HEAD request using httplib. A HEAD request is the same as a GET request, but the server doesn't send then message body.

来源：https://stackoverflow.com/questions/9890815/python-get-headers-only-using-urllib2

标签

python

urllib2