Requests - get content-type/size without fetching the whole page/content

╄→гoц情女王★ 提交于 2019-12-21 04:50:32

问题


I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using file extension is probably not the best idea.

Is it possible to get content-type and content length/size without fetching the whole content/page?

Here is my code:

requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
    session.max_redirects = self.maxRedirects
else:
    session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
    response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
    currentUrl = response.url
    currentUrlData = urlparse.urlparse(currentUrl)
    currentUrlDomain = currentUrlData.netloc
    domainWWW = 'www.' + str(urlDomain)
    headers = response.headers
    contentType = str(headers['content-type'])
except:
    logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
    logging.exception("Get page exception:")
    response = None

回答1:


Yes.

You can use the Session.head method to create HEAD requests:

response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']

A HEAD request similar to GET request, except that the message body would not be sent.

Here is a quote from Wikipedia:

HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.




回答2:


Use requests.head() for this. It will not return the message body. You should use head method if you are interested only in the headers. Check this link for detail.

h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')



回答3:


Sorry, my mistake, I should read documentation better. Here is the answer: http://docs.python-requests.org/en/latest/user/advanced/#advanced (Body Content Workflow)

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
  r.connection.close()
  # log request too long



回答4:


Because requests.head() does NOT auto redirect, so a URL is redirected, requests.head() will get 0 for Content-Length. So make sure allow_redirects=True is added.

r = requests.head(url, allow_redirects=True)
length = r.headers['Content-Length']

Refer to Requests Redirection And History



来源:https://stackoverflow.com/questions/23718424/requests-get-content-type-size-without-fetching-the-whole-page-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!