I need to create a user configurable web spider/crawler, and I\'m thinking about using Scrapy. But, I can\'t hard-code the domains and allowed URL regex:es -- this will inst
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a
, and use the start_requests
function to setup how to start the spider
You don't need to setup the allowed_domains
variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"