Scrapy crawl all sitemap links

前端未结

关注

 2  791

一个人的身影 2021-01-07 09:57

I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t

2条回答

[愿得一人] (楼主)

2021-01-07 10:24

Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
        return Request(response.url, callback=self.parse_sitemap_url)

    def parse_sitemap_url(self, response):
        # do stuff with your sitemap links

0 讨论(0)

查看其它2个回答