parsing xml containing default namespace to get an element value using lxml

谁说胖子不能爱 提交于 2019-11-26 10:00:02

问题


I have a xml string like this

str1 = \"\"\"<sitemapindex xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex> \"\"\"

I want to extract all the urls present inside <loc> node i.e http://www.example.org/sitemap_1.xml.gz

I tried this code but it didn\'t word

from lxml import etree
root = etree.fromstring(str1)
urls = root.xpath(\"//loc/text()\")
print urls
[]

I tried to check if my root node is formed correctly. I tried this and get back the same string as str1

etree.tostring(root)

\'<sitemapindex xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\\n<sitemap>\\n<loc>http://www.example.org/sitemap_1.xml.gz</loc>\\n<lastmod>2015-07-01</lastmod>\\n</sitemap>\\n</sitemapindex>\'

回答1:


This is a common error when dealing with XML having default namespace. Your XML has default namespace, a namespace declared without prefix, here :

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Note that not only element where default namespace declared is in that namespace, but all descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix or local default namespace that point to different namespace uri). That means, in this case, all elements including loc are in default namespace.

To select element in namespace, you'll need to define prefix to namespace mapping and use the prefix properly in the XPath :

from lxml import etree
str1 = '''<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex>'''
root = etree.fromstring(str1)

ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"}
url = root.xpath("//d:loc", namespaces=ns)[0]
print etree.tostring(url)

output :

<loc xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        http://www.example.org/sitemap_1.xml.gz
    </loc>


来源:https://stackoverflow.com/questions/31177707/parsing-xml-containing-default-namespace-to-get-an-element-value-using-lxml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!