I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using
Yes.
You can use the Session.head
method to create HEAD
requests:
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
A HEAD
request similar to GET
request, except that the message body would not be sent.
Here is a quote from Wikipedia:
HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.