Scrapy crawl all sitemap links

前端 未结 2 784
一个人的身影
一个人的身影 2021-01-07 09:57

I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t

2条回答
  •  [愿得一人]
    2021-01-07 10:24

    Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

    class MySpider(SitemapSpider):
        name = "xyz"
        allowed_domains = ["xyz.nl"]
        sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 
    
        def parse(self, response):
            print response.url
            return Request(response.url, callback=self.parse_sitemap_url)
    
        def parse_sitemap_url(self, response):
            # do stuff with your sitemap links
    

提交回复
热议问题