发表新帖

发表新帖

Crawl multiple domains with Scrapy without criss-cross

后端未结

关注

 2  865

梦毁少年i 2021-01-07 03:28

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

2条回答

陌清茗 (楼主)

2021-01-07 04:23

You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题