Using threads within a scrapy spider

问题

Is it possible to use multiple threads within a scrapy spider? For example lets say i have built a spider which crawl blog topics and saves all the messages within. I would like to couple every topic to a thread from a pool, and the thread will crawl all the needed information. Each thread will crawl a different topic that way..

回答1:

Scrapy itself is single-threaded, and resultantly you cannot use multiple threads within a spider. You can however, make use of multiple spiders at the same time (CONCURRENT_REQUESTS), which may help you (see Common Practices)

Scrapy does not use multithreading as it is built on Twisted, which is an asynchronous http framework.

回答2:

The marked answer is not 100% correct.

Scrapy runs on twisted and it supports returning deferreds from the pipeline process_item method.

This means you can create a deferred in the pipeline as for example from threads.deferToThread. This will run your CPU bound code inside the reactor thread pool. Be careful to make correct use of callFromThread where appropriate. I use a semaphore to avoid exhausting all threads from the thread pool, but setting good values for the settings mentioned below might also work.

http://twistedmatrix.com/documents/13.2.0/core/howto/threading.html

Here a method from one of my Item pipelines:

def process_item(self, item, spider):
    def handle_error(item):
        raise DropItem("error processing %s", item)

    d = self.sem.run(threads.deferToThread, self.do_cpu_intense_work, item)
    d.addCallback(lambda _: item)
    d.addErrback(lambda _: handle_error(item))
    return d

You may want to keep an eye on

REACTOR_THREADPOOL_MAXSIZE as described here: http://doc.scrapy.org/en/latest/topics/settings.html#reactor-threadpool-maxsize

CONCURRENT_ITEMS as described here http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items

You are still facing the Python GIL though, which means CPU intense tasks will not really run in parallel on multiple CPUs anyway, they will just pretend to do that. The GIL is only released for IO. But You can use this method to use an IO blocking 3rd party lib (e.g. webservice calls) inside your item pipeline without blocking the reactor thread.

来源：https://stackoverflow.com/questions/29474659/using-threads-within-a-scrapy-spider

标签

python

scrapy