catch errors within generator and continue afterwards

问题

I have an iterator which is supposed to run for several days. I want errors to be caught and reported, and then I want the iterator to continue. Or the whole process can start over.

Here's the function:

def get_units(self, scraper):
    units = scraper.get_units()
    i = 0
    while True:
        try:
            unit = units.next()
        except StopIteration:
            if i == 0:
                log.error("Scraper returned 0 units", {'scraper': scraper})
            break
        except:
            traceback.print_exc()
            log.warning("Exception occurred in get_units", extra={'scraper': scraper, 'iteration': i})
        else:
            yield unit
        i += 1

Because scraper could be one of many variants of code, it can't be trusted and I don't want to handle the errors there.

But when an error occurs in units.next(), the whole thing stops. I suspect because an iterator throws a StopIteration when one of it's iterations fails.

Here's the output (only the last lines)

[2012-11-29 14:11:12 /home/amcat/amcat/scraping/scraper.py:135 DEBUG] Scraping unit <Element div at 0x4258c710>
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article
[2012-11-29 14:11:13 /home/amcat/amcat/scraping/scraper.py:138 DEBUG] .. yields article Counter-Strike: Global Offensive Update Released
Traceback (most recent call last):
  File "/home/amcat/amcat/scraping/controller.py", line 101, in get_units
    unit = units.next()
  File "/home/amcat/amcat/scraping/scraper.py", line 114, in get_units
    for unit in self._get_units():
  File "/home/amcat/scraping/games/steamcommunity.py", line 90, in _get_units
    app_doc = self.getdoc(url,urlencode(form))
  File "/home/amcat/amcat/scraping/scraper.py", line 231, in getdoc
    return self.opener.getdoc(url, encoding)
  File "/home/amcat/amcat/scraping/htmltools.py", line 54, in getdoc
    response = self.opener.open(url, encoding)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
[2012-11-29 14:11:14 /home/amcat/amcat/scraping/controller.py:110 WARNING] Exception occurred in get_units

...code ends...

So how can I prevent the iterating to stop when an error occurs?

EDIT: here's the code within get_units()

def get_units(self):
    """                                                                                                                                                                                                                                  
    Split the scraping job into a number of 'units' that can be processed independently                                                                                                                                                  
    of each other.                                                                                                                                                                                                                       

    @return: a sequence of arbitrary objects to be passed to scrape_unit                                                                                                                                                                 
    """
    self._initialize()
    for unit in self._get_units():
        yield unit

And here's a simplified _get_units():

INDEX_URL = "http://www.steamcommunity.com"

def _get_units(self):
  doc = self.getdoc(INDEX_URL)  #returns a lxml.etree document

  for a in doc.cssselect("div.discussion a"):
    link = a.get('href')
    yield link

EDIT: question followup: Alter each for-loop in a function to have error handling executed automatically after each failed iteration

回答1:

StopIteration is raised by the next() method of a generator when there is no next item anymore. It has nothing to do with errors inside the generator/iterator.

Another thing to note is that, depending on the type of your iterator, it might not be able to resume after an exception. If the iterator is an object with a next method, it will work. However, if it's actually a generator, it won't.

As far as I can tell, this is the only reason why your iteration doesn't continue after an error from units.next(). I.e. units.next() fails, and the next time you call it, it's not able to resume and it says it's done by throwing a StopIteration exception.

Basically you'd have to show us the code inside scraper.get_units() for us to understand why the loop is not able to continue after an error inside a single iteration. If get_units() is implemented as a generator function, it's clear. If not, it might be something else that's preventing it from resuming.

UPDATE: explaining what a generator function is:

class Scraper(object):
    def get_units(self):
        for i in some_stuff:
            bla = do_some_processing()
            bla *= 2  # random stuff
            yield bla

Now, when you call Scraper().get_units(), instead of running the entire function, it returns a generator object. Calling next() on it, will take the execution to the first yield. Etc. Now if an error occurs ANYWHERE inside get_units, it will be tainted, so to say, and the next time you call next(), it will raise StopIteration, just as if it had run out of items to give you.

Reading of http://www.dabeaz.com/generators/ (and http://www.dabeaz.com/coroutines/) strongly recommended.

UPDATE2: A possible solution https://gist.github.com/4175802

来源：https://stackoverflow.com/questions/13645112/catch-errors-within-generator-and-continue-afterwards

标签

python

exception-handling

try-catch

iteration