问题
Is it possible to use multiple threads within a scrapy spider? For example lets say i have built a spider which crawl blog topics and saves all the messages within. I would like to couple every topic to a thread from a pool, and the thread will crawl all the needed information. Each thread will crawl a different topic that way..
回答1:
Scrapy itself is single-threaded, and resultantly you cannot use multiple threads within a spider. You can however, make use of multiple spiders at the same time (CONCURRENT_REQUESTS
), which may help you (see Common Practices)
Scrapy does not use multithreading as it is built on Twisted, which is an asynchronous http framework.
回答2:
The marked answer is not 100% correct.
Scrapy runs on twisted and it supports returning deferreds from the pipeline process_item
method.
This means you can create a deferred in the pipeline as for example from threads.deferToThread
. This will run your CPU bound code inside the reactor thread pool. Be careful to make correct use of callFromThread
where appropriate. I use a semaphore to avoid exhausting all threads from the thread pool, but setting good values for the settings mentioned below might also work.
http://twistedmatrix.com/documents/13.2.0/core/howto/threading.html
Here a method from one of my Item pipelines:
def process_item(self, item, spider):
def handle_error(item):
raise DropItem("error processing %s", item)
d = self.sem.run(threads.deferToThread, self.do_cpu_intense_work, item)
d.addCallback(lambda _: item)
d.addErrback(lambda _: handle_error(item))
return d
You may want to keep an eye on
REACTOR_THREADPOOL_MAXSIZE
as described here: http://doc.scrapy.org/en/latest/topics/settings.html#reactor-threadpool-maxsize
CONCURRENT_ITEMS
as described here http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items
You are still facing the Python GIL though, which means CPU intense tasks will not really run in parallel on multiple CPUs anyway, they will just pretend to do that. The GIL is only released for IO. But You can use this method to use an IO blocking 3rd party lib (e.g. webservice calls) inside your item pipeline without blocking the reactor thread.
来源:https://stackoverflow.com/questions/29474659/using-threads-within-a-scrapy-spider