How to find all guide IDs and pages with IMG tags in XML export with lxml/xpath?

不打扰是莪最后的温柔 提交于 2019-12-11 17:55:53

问题


How can I parse the below XML in order to find for each GUIDE, it's ID and UL, then for each PAGE inside GUIDE, the page ID and any images that appear inside BOXES / BOX / ASSETS / DESCRIPTION? The images are in HTML format so I need to grab the source from each image.

  <guide>
    <id></id>
   <url></url>
  <group>
   <id></id> 
<type></type>
<name></name>
   </group>
   <pages>
    <page>
 <id></id>
 <name></name>
 <description></description>
 <boxes>
  <box>
   <id></id>
   <name></name>
   <type></type>
   <map_id></map_id>
   <column></column>
   <position></position>
   <hidden></hidden>
   <created></created>
   <updated></updated>
   <assets>
    <asset>
     <id></id>
     <name></name>
     <type></type>
     <description></description>
     <url/>
     <owner>
      <id></id>
      <email></email>
      <first_name></first_name>
      <last_name></last_name>
     </owner>
    </asset>
      </assets>
     </box>
    </boxes>
   </page>
   </pages>
    </guide>

This gives me the pages with their ID and descriptions but it's the descriptions inside the asset elements I need to access, and the guide/page they are on.

from lxml import etree
tree = etree.parse('temp.xml')
for page in tree.xpath('.//page'):
    page.xpath('id')[0].text, page.xpath('description')[0].text

回答1:


The pattern of the code is probably similar but I can't check this because I don't have your full xml.

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'description')

I assumed that your xml would have multiple guide elements. This is what I parsed.

<guides>
    <guide>
        <id>guide 1</id>
        <url></url>
        <group>
        <id></id> 
        <type></type>
        <name></name>
        </group>
        <pages>
            <page>
                <id>page 1</id>
                <name></name>
                <description></description>
                <boxes>
                    <box>
                        <id></id>
                        <name></name>
                        <type></type>
                        <map_id></map_id>
                        <column></column>
                        <position></position>
                        <hidden></hidden>
                        <created></created>
                        <updated></updated>
                        <assets>
                            <asset>
                                <id></id>
                                <name></name>
                                <type></type>
                                <description>description</description>
                                <url/>
                                <owner>
                                    <id></id>
                                    <email></email>
                                    <first_name></first_name>
                                    <last_name></last_name>
                                </owner>
                            </asset>
                        </assets>
                    </box>
                </boxes>
            </page>
        </pages>
    </guide>
</guides>

I made life easier for myself by indenting the xml so that I could discern its structure.



来源:https://stackoverflow.com/questions/47078480/how-to-find-all-guide-ids-and-pages-with-img-tags-in-xml-export-with-lxml-xpath

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!