Using one Scrapy spider for several websites

后端 未结 4 432
南方客
南方客 2020-12-08 11:47

I need to create a user configurable web spider/crawler, and I\'m thinking about using Scrapy. But, I can\'t hard-code the domains and allowed URL regex:es -- this will inst

4条回答
  •  时光取名叫无心
    2020-12-08 12:19

    Now it is extremely easy to configure scrapy for these purposes:

    1. About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider

    2. You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.

    It should end up to something like:

    class MySpider(Spider):
    
        name = "myspider"
    
        def start_requests(self):
            yield Request(self.start_url, callback=self.parse)
    
    
        def parse(self, response):
            ...
    

    and you should call it with:

    scrapy crawl myspider -a start_url="http://example.com"
    

提交回复
热议问题