Using multiple start_urls in CrawlSpider

孤者浪人 提交于 2020-01-16 18:38:07

问题


I'm using CrawlSpider to crawl a website. I have multiple start urls, and in each url, there is a "next link" linking to another similar page. I use rules to deal with the next page.

rules = (
          Rule(SgmlLinkExtractor(allow = ('/',),
             restrict_xpaths=('//span[@class="next"]')),
             callback='parse_item',
             follow=True),
         )

When there is only a url in start_urls, everything is ok. However, when there are many urls in start_urls, I got "Ignoring response <404 a url> : HTTP status code is not handled or not allowed".

How can I start with the first url in start_urls, after dealing with all the "next link", and then start with the second url in start_urls?

Here is my code

class DoubanSpider(CrawlSpider):
    name = "doubanBook"
    allowed_domains = ["book.douban.com"]
    category = codecs.open("category.txt","r",encoding="utf-8")

    start_urls = []
    for line in category:
        line = line.strip().rstrip()
        start_urls.append(line)

     rules = (
             Rule(SgmlLinkExtractor(allow = ('/',),
                 restrict_xpaths=('//span[@class="next"]')),
                 callback='parse_item',
                 follow=True),
              )


     def parse_item(self, response):
         sel = Selector(response)
         out = open("alllink.txt","a")
         sites = sel.xpath('//ul/li/div[@class="info"]/h2')
         for site in sites:
             href = site.xpath('a/@href').extract()[0]
             title = site.xpath('a/@title').extract()[0]
             out.write("***")
         out.close()

来源:https://stackoverflow.com/questions/26686299/using-multiple-start-urls-in-crawlspider

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!