Scrapy Use both the CORE in the system

问题

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.

When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.

That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.

Is there any workaround to convince scrapy to use all the core ?

回答1:

Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

And there is nothing that stops you from having the same class for both the spiders.

process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below

process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)

Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.

In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.

PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

回答2:

Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

来源：https://stackoverflow.com/questions/45660835/scrapy-use-both-the-core-in-the-system

标签

scrapy

twisted