I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using
Sorry, my mistake, I should read documentation better. Here is the answer: http://docs.python-requests.org/en/latest/user/advanced/#advanced (Body Content Workflow)
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
r.connection.close()
# log request too long
Use requests.head() for this. It will not return the message body. You should use head method if you are interested only in the headers. Check this link for detail.
h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')
Because requests.head() does NOT auto redirect, so a URL is redirected, requests.head() will get 0 for Content-Length. So make sure allow_redirects=True is added.
r = requests.head(url, allow_redirects=True)
length = r.headers['Content-Length']
Refer to Requests Redirection And History
Yes.
You can use the Session.head method to create HEAD requests:
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
A HEAD request similar to GET request, except that the message body would not be sent.
Here is a quote from Wikipedia:
HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.