Scrapy gets stuck crawling a long list of urls

问题

I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated.

I used to retrieve the entire list of urls in one go, and the crawler worked fine. However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred. I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.

The relevant parts of the code are as follow. Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.

class MySpider(CrawlSpider):
    name = "mySpider"
    item = []
    #spider settings
    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        'DEPTH_LIMIT': 1,
        'DNS_TIMEOUT': 5,
        'DOWNLOAD_TIMEOUT':5,
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 1
    }


    def start_requests(self):

        while i < n_urls:
            urllist = "SELECT url FROM database WHERE id=" + i
            cursor = db.cursor()
            cursor.execute(urllist)
            urls = cursor.fetchall()
            urls = [i[0] for i in urls] #fetch url from inside list of tuples
            urls = str(urls[0]) #transform url into string from list
            yield Request(urls, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        global i
        sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
        val = ('Error', str(j))
        cursor.execute(sql, val)
        db.commit()
        i += 1


    def parse(self, response):
        global i
        item = myItem()
        item["result"] = response.xpath("//item to search")
        if item["result"] is None or len(item["result"]) == 0:
            sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
            val = ('None', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1
        else:
            sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
            val = ('Item', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1

The scraper gets stuck showing the following message:

2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Everything works fine up until this point. Any help you could give me is appreciated!

回答1:

The reason scrapy syas 0 item is that it counts the yielded data while you are not yielding anything but inserting in your database.

回答2:

I just had this happen to me, so I wanted to share what caused the bug, in case someone encounters the exact same issue.

Apparently, if you don't specify a callback for a Request, it defaults to the spider's parse method as a callback (my intention was to not have a callback at all for those requests).

In my spider, I used the parse method to make most of the Requests, so this behavior caused many unnecessary requests that eventually led to Scrapy crashing. Simply adding an empty callback function (lambda a: None) for those requests solved my issue.

来源：https://stackoverflow.com/questions/54184403/scrapy-gets-stuck-crawling-a-long-list-of-urls

标签

python

scrapy