How to handle IncompleteRead: in python

匿名 (未验证) 提交于 2019-12-03 01:06:02

问题:

I am trying to fetch some data from a website. However it returns me incomplete read. The data I am trying to get is a huge set of nested links. I did some research online and found that this might be due to a server error (A chunked transfer encoding finishing before reaching the expected size). I also found a workaround for above on this link

However, I am not sure as to how to use this for my case. Following is the code I am working on

br = mechanize.Browser() br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')] urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands" page = urllib2.urlopen(urls).read() soup = BeautifulSoup(page) links = soup.findAll('img',url=True)  for tag in links:     name = tag['alt']     tag['url'] = urlparse.urljoin(urls, tag['url'])     r = br.open(tag['url'])     page_child = br.response().read()     soup_child = BeautifulSoup(page_child)     contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]     data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]     print contracts     print data_usage

Please help me with this.Thanks

回答1:

The link you included in your question is simply a wrapper that executes urllib's read() function, which catches any incomplete read exceptions for you. If you don't want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example:

try:     page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, e:     page = e.partial

for python3

try:     page = request.urlopen(urls).read() except (http.client.IncompleteRead) as e:     page = e.partial


回答2:

I find out in my case : send HTTP/1.0 request , adding this , fix the problem.

import httplib httplib.HTTPConnection._http_vsn = 10 httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after I do the request :

req = urllib2.Request(url, post, headers) filedescriptor = urllib2.urlopen(req) img = filedescriptor.read()

after I back to http 1.1 with (for connections that support 1.1) :

httplib.HTTPConnection._http_vsn = 11 httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

the trick is use http 1.0 instead the default http/1.1 http 1.1 could handle chunks but for some reason webserver don't , so we do the request in http 1.0



回答3:

What worked for me is catching IncompleteRead as an exception and harvesting the data you managed to read in each iteration by putting this into a loop like below: (Note, I am using Python 3.4.1 and the urllib library has changed between 2.7 and 3.4)

try:     requestObj = urllib.request.urlopen(url, data)     responseJSON=""     while True:         try:             responseJSONpart = requestObj.read()         except http.client.IncompleteRead as icread:             responseJSON = responseJSON + icread.partial.decode('utf-8')             continue         else:             responseJSON = responseJSON + responseJSONpart.decode('utf-8')             break      return json.loads(responseJSON)  except Exception as RESTex:     print("Exception occurred making REST call: " + RESTex.__str__())


回答4:

I found that my virus detector/firewall was causing this problem. "Online Shield" part of AVG.



回答5:

You can use requests instead of urllib2. requests is based on urllib3 so it rarely have any

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!