How to limit scrapy request objects?

与世无争的帅哥 提交于 2019-12-03 21:43:19

I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.

After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)

So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.

$ scrapy crawl spider -s JOBDIR=somedirname

when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.

It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.

http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!