how to parse a sitemap.xml file using scrapy's XmlFeedSpider?

女生的网名这么多〃 提交于 2019-12-03 20:40:38

Scrapy uses lxml / libxml2 under the hood, eventually invoking the node.xpath() method to perform the selection. Any elements in your xpath expression which are namespaced must be prefixed, and you must pass a mapping to tell the selector which namespace each prefix resolves to.

Here is an example to illustrate how to map prefixes to namespaces when using the node.xpath() method:

doc = '<root xmlns="chaos"><bar /></root>'
tree = lxml.etree.fromstring(doc)
tree.xpath('//bar')
[]
tree.xpath('//x:bar', namespaces={'x': 'chaos'})
[<Element {chaos}bar at 7fa40f9c50a8>]

Without having used this scrapy XMLFeedSpider class, I'm guessing your namespace map and itertag need to follow the same scheme:

class SitemapSpider(XMLFeedSpider):
    namespaces = [
        ('sm', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ]
     itertag = 'sm:url'

I found that the difference between hxs and xxs were helpful. I found it difficult to locate the xxs object. I was trying to use this

x = XmlXPathSelector(response)

When these worked far better for what I needed.

hxs.select('//p/text()').extract()

or

xxs.select('//title/text()').extract()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!