问题
How can I parse the below XML in order to find for each GUIDE, it's ID and UL, then for each PAGE inside GUIDE, the page ID and any images that appear inside BOXES / BOX / ASSETS / DESCRIPTION? The images are in HTML format so I need to grab the source from each image.
<guide>
<id></id>
<url></url>
<group>
<id></id>
<type></type>
<name></name>
</group>
<pages>
<page>
<id></id>
<name></name>
<description></description>
<boxes>
<box>
<id></id>
<name></name>
<type></type>
<map_id></map_id>
<column></column>
<position></position>
<hidden></hidden>
<created></created>
<updated></updated>
<assets>
<asset>
<id></id>
<name></name>
<type></type>
<description></description>
<url/>
<owner>
<id></id>
<email></email>
<first_name></first_name>
<last_name></last_name>
</owner>
</asset>
</assets>
</box>
</boxes>
</page>
</pages>
</guide>
This gives me the pages with their ID and descriptions but it's the descriptions inside the asset elements I need to access, and the guide/page they are on.
from lxml import etree
tree = etree.parse('temp.xml')
for page in tree.xpath('.//page'):
page.xpath('id')[0].text, page.xpath('description')[0].text
回答1:
The pattern of the code is probably similar but I can't check this because I don't have your full xml.
>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', description.text
...
('---', 'guide 1')
('------', 'page 1')
('---------', 'description')
I assumed that your xml would have multiple guide
elements. This is what I parsed.
<guides>
<guide>
<id>guide 1</id>
<url></url>
<group>
<id></id>
<type></type>
<name></name>
</group>
<pages>
<page>
<id>page 1</id>
<name></name>
<description></description>
<boxes>
<box>
<id></id>
<name></name>
<type></type>
<map_id></map_id>
<column></column>
<position></position>
<hidden></hidden>
<created></created>
<updated></updated>
<assets>
<asset>
<id></id>
<name></name>
<type></type>
<description>description</description>
<url/>
<owner>
<id></id>
<email></email>
<first_name></first_name>
<last_name></last_name>
</owner>
</asset>
</assets>
</box>
</boxes>
</page>
</pages>
</guide>
</guides>
I made life easier for myself by indenting the xml so that I could discern its structure.
来源:https://stackoverflow.com/questions/47078480/how-to-find-all-guide-ids-and-pages-with-img-tags-in-xml-export-with-lxml-xpath