Scrapy start_urls

后端 未结 6 1565
有刺的猬
有刺的猬 2020-12-28 23:23

The script (below) from this tutorial contains two start_urls.

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirb         


        
6条回答
  •  清歌不尽
    2020-12-29 00:06

    If you use an rule to follow links (that is already implemented in scrapy), the spider will scrape them too. I hope have helped...

        from scrapy.contrib.spiders import BaseSpider, Rule
        from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
        from scrapy.selector import HtmlXPathSelector
    
    
        class Spider(BaseSpider):
            name = 'my_spider'
            start_urls = ['http://www.domain.com/']
            allowed_domains = ['domain.com']
            rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)]
    
         ...
    

提交回复
热议问题