问题
I have to implement a function to get headers only (without doing a GET or POST) using urllib2. Here is my function:
def getheadersonly(url, redirections = True):
if not redirections:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookieprocessor = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
info = {}
info['headers'] = dict(urllib2.urlopen(HeadRequest(url)).info())
info['finalurl'] = urllib2.urlopen(HeadRequest(url)).geturl()
return info
Uses code from answer this and this. However this is doing redirection even when the flag is False. I tried the code with:
print getheadersonly("http://ms.com", redirections = False)['finalurl']
print getheadersonly("http://ms.com")['finalurl']
Its giving morganstanley.com in both cases. What is wrong here?
回答1:
Firstly, your code contains several bugs:
On each request of
getheadersonlyyou install a new global urlopener which is then used in subsequent calls ofurllib2.urlopenYou make two HTTP-requests to get two different attributes of a response.
The implementation of
urllib2.HTTPRedirectHandler.http_error_302is not so trivial and I do not understand how can it prevent redirections in the first place.
Basically, you should understand that each handler is installed in an opener to handle certain kind of response. urllib2.HTTPRedirectHandler is there to convert certain http-codes into a redirections. If you do not want redirections, do not add a redirection handler into the opener. If you do not want to open ftp links, do not add FTPHandler, etc.
That is all you need is to create a new opener and add the urllib2.HTTPHandler() in it, customize the request to be 'HEAD' request and pass an instance of the request to the opener, read the attributes, and close the response.
class HeadRequest(urllib2.Request):
def get_method(self):
return 'HEAD'
def getheadersonly(url, redirections=True):
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
opener.add_handler(urllib2.HTTPDefaultErrorHandler())
if redirections:
# HTTPErrorProcessor makes HTTPRedirectHandler work
opener.add_handler(urllib2.HTTPErrorProcessor())
opener.add_handler(urllib2.HTTPRedirectHandler())
try:
res = opener.open(HeadRequest(url))
except urllib2.HTTPError, res:
pass
res.close()
return dict(code=res.code, headers=res.info(), finalurl=res.geturl())
回答2:
You can send a HEAD request using httplib. A HEAD request is the same as a GET request, but the server doesn't send then message body.
来源:https://stackoverflow.com/questions/9890815/python-get-headers-only-using-urllib2