I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t
Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links