Scrapy not crawling all the pages

房东的猫 提交于 2020-01-02 05:59:17

问题


I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

I am using the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the crawl results are as follows -

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

It is not crawling all the pages.

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.

What am I missing?


回答1:


I had a similar problem today, although I was using a custom spider. It turned out that the website was limiting my crawl because my useragent was scrappy-bot

try changing your user agent and try again. Change it to maybe that of a known browser

Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps

More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html




回答2:


Scrapy will by default filter out all duplicate requests.

You can circumvent this by using (example):

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

Also see the Request object documentation




回答3:


Maybe a lot of the URLs are duplicates. Scrapy avoid duplicates since it is inefficient. From what I see from your explanation since you use follow URL rule, of course, there is a lot of duplicates.

If you want to be sure and see the proof in the log, add this to your settings.py.

DUPEFILTER_DEBUG = True

And you'll see this kind of lines in the log:

2016-09-20 17:08:47 [scrapy] DEBUG: Filtered duplicate request: http://www.example.org/example.html>



来源:https://stackoverflow.com/questions/8381082/scrapy-not-crawling-all-the-pages

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!