问题
I have an iterator which is supposed to run for several days. I want errors to be caught and reported, and then I want the iterator to continue. Or the whole process can start over.
Here's the function:
def get_units(self, scraper):
units = scraper.get_units()
i = 0
while True:
try:
unit = units.next()
except StopIteration:
if i == 0:
log.error("Scraper returned 0 units", {'scraper': scraper})
break
except:
traceback.print_exc()
log.warning("Exception occurred in get_units", extra={'scraper': scraper, 'iteration': i})
else:
yield unit
i += 1
Because scraper
could be one of many variants of code, it can't be trusted and I don't want to handle the errors there.
But when an error occurs in units.next()
, the whole thing stops. I suspect because an iterator throws a StopIteration
when one of it's iterations fails.
Here's the output (only the last lines)
[2012-11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG] Scraping unit <Element div at 0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article Counter-Strike: Global Offensive Update Released
Traceback (most recent call last):
File "/home/amcat/amcat/scraping/controller.py", line 101, in get_units
unit = units.next()
File "/home/amcat/amcat/scraping/scraper.py", line 114, in get_units
for unit in self._get_units():
File "/home/amcat/scraping/games/steamcommunity.py", line 90, in _get_units
app_doc = self.getdoc(url,urlencode(form))
File "/home/amcat/amcat/scraping/scraper.py", line 231, in getdoc
return self.opener.getdoc(url, encoding)
File "/home/amcat/amcat/scraping/htmltools.py", line 54, in getdoc
response = self.opener.open(url, encoding)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110 WARNING] Exception occurred in get_units
...code ends...
So how can I prevent the iterating to stop when an error occurs?
EDIT: here's the code within get_units()
def get_units(self):
"""
Split the scraping job into a number of 'units' that can be processed independently
of each other.
@return: a sequence of arbitrary objects to be passed to scrape_unit
"""
self._initialize()
for unit in self._get_units():
yield unit
And here's a simplified _get_units():
INDEX_URL = "http://www.steamcommunity.com"
def _get_units(self):
doc = self.getdoc(INDEX_URL) #returns a lxml.etree document
for a in doc.cssselect("div.discussion a"):
link = a.get('href')
yield link
EDIT: question followup: Alter each for-loop in a function to have error handling executed automatically after each failed iteration
回答1:
StopIteration
is raised by the next()
method of a generator when there is no next item anymore. It has nothing to do with errors inside the generator/iterator.
Another thing to note is that, depending on the type of your iterator, it might not be able to resume after an exception. If the iterator is an object with a next
method, it will work. However, if it's actually a generator, it won't.
As far as I can tell, this is the only reason why your iteration doesn't continue after an error from units.next()
. I.e. units.next()
fails, and the next time you call it, it's not able to resume and it says it's done by throwing a StopIteration
exception.
Basically you'd have to show us the code inside scraper.get_units()
for us to understand why the loop is not able to continue after an error inside a single iteration. If get_units()
is implemented as a generator function, it's clear. If not, it might be something else that's preventing it from resuming.
UPDATE: explaining what a generator function is:
class Scraper(object):
def get_units(self):
for i in some_stuff:
bla = do_some_processing()
bla *= 2 # random stuff
yield bla
Now, when you call Scraper().get_units()
, instead of running the entire function, it returns a generator object. Calling next()
on it, will take the execution to the first yield
. Etc. Now if an error occurs ANYWHERE inside get_units
, it will be tainted, so to say, and the next time you call next()
, it will raise StopIteration
, just as if it had run out of items to give you.
Reading of http://www.dabeaz.com/generators/ (and http://www.dabeaz.com/coroutines/) strongly recommended.
UPDATE2: A possible solution https://gist.github.com/4175802
来源:https://stackoverflow.com/questions/13645112/catch-errors-within-generator-and-continue-afterwards