发表新帖

发表新帖

Crawl multiple domains with Scrapy without criss-cross

后端未结

关注

 2  870

梦毁少年i 2021-01-07 03:28

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

2条回答

被撕碎了的回忆 (楼主)

2021-01-07 04:06
I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

Therefore, override start_requests:
```
def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
```
In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题