How do I improve scrapy's download speed?

可紊 提交于 2019-12-20 09:25:27

问题


I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.

Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.

How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.

Here's the relevant part of the settings.py file. Is there some important setting I've missed?

LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8

A few parameters:

  • Using scrapy version 0.14
  • The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
  • I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
  • As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
  • For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)

回答1:


I had this problem in the past... And large part of it I solved with a 'Dirty' old tricky.

Do a local cache DNS.

Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.

And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.

In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.

I hope this will help you in your problem!



来源:https://stackoverflow.com/questions/12427451/how-do-i-improve-scrapys-download-speed

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!