How to handle IncompleteRead: in python

I am trying to fetch some data from a website. However it returns me incomplete read. The data I am trying to get is a huge set of nested links. I did some research online and found that this might be due to a server error (A chunked transfer encoding finishing before reaching the expected size). I also found a workaround for above on this link

However, I am not sure as to how to use this for my case. Following is the code I am working on

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)

for tag in links:
    name = tag['alt']
    tag['url'] = urlparse.urljoin(urls, tag['url'])
    r = br.open(tag['url'])
    page_child = br.response().read()
    soup_child = BeautifulSoup(page_child)
    contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
    data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
    print contracts
    print data_usage

Please help me with this.Thanks

Kyle

The link you included in your question is simply a wrapper that executes urllib's read() function, which catches any incomplete read exceptions for you. If you don't want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example:

try:
    page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
    page = e.partial

for python3

try:
    page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
    page = e.partial

I find out in my case : send HTTP/1.0 request , adding this , fix the problem.

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after I do the request :

req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()

after I back to http 1.1 with (for connections that support 1.1) :

httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

the trick is use http 1.0 instead the default http/1.1 http 1.1 could handle chunks but for some reason webserver don't , so we do the request in http 1.0

What worked for me is catching IncompleteRead as an exception and harvesting the data you managed to read in each iteration by putting this into a loop like below: (Note, I am using Python 3.4.1 and the urllib library has changed between 2.7 and 3.4)

try:
    requestObj = urllib.request.urlopen(url, data)
    responseJSON=""
    while True:
        try:
            responseJSONpart = requestObj.read()
        except http.client.IncompleteRead as icread:
            responseJSON = responseJSON + icread.partial.decode('utf-8')
            continue
        else:
            responseJSON = responseJSON + responseJSONpart.decode('utf-8')
            break

    return json.loads(responseJSON)

except Exception as RESTex:
    print("Exception occurred making REST call: " + RESTex.__str__())

I found that my virus detector/firewall was causing this problem. "Online Shield" part of AVG.

You can use requests instead of urllib2. requests is based on urllib3 so it rarely have any problem. Put it in a loop to try it 3 times, and it will be much stronger. You can use it this way:

import requests      

msg = None   
for i in [1,2,3]:        
    try:  
        r = requests.get(self.crawling, timeout=30)
        msg = r.text
        if msg: break
    except Exception as e:
        sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
        if i == 3 :
            sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'.                       format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
            raise e
        time.sleep(10*(i-1))

I tried all these solutions and none of them worked for me. Actually, what did work is instead of using urllib, I just used http.client (Python 3)

conn = http.client.HTTPConnection('www.google.com')
conn.request('GET', '/')
r1 = conn.getresponse()
page = r1.read().decode('utf-8')

This works perfectly every time, whereas with urllib it was returning an incompleteread exception every time.

I just add a more exception to pass this problem.
just like

try:
    r = requests.get(url, timeout=timeout)

except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e:
    logging.error("There is a error: %s" % e)

来源：https://stackoverflow.com/questions/14442222/how-to-handle-incompleteread-in-python

标签

python

python-2.7

web-scraping

beautifulsoup

mechanize