How to find direct children of element in lxml

对着背影说爱祢 提交于 2019-12-01 08:29:15

I'm not sure, but it seem that your problem is in HTML itself: note that there are couple Tag omission cases applicable for p nodes, so closing tags of paragraphs

<div class='thread'>
    <p>first
        <p>second</p>
    </p>
</div>

simply ignored by parser and both nodes identified as siblings, but not parent and child, e.g.

<div class='thread'>
    <p>first
    <p>second
</div>

So XPath //div[@class="thread"]/p will return you both paragraphs

You can simply replace p tags with div tags and you'll see different behaviour:

<div class='thread'>
    <div>first
        <div>second</div>
    </div>
</div>

Here //div[@class="thread"]/div will return first node only

Please correct me if my assumption is incorrect

Try this XPath expression:

//p[parent::div[@class='thread']]

Or in a complete Python expression:

THREAD.xpath("//p[parent::div[@class='thread']]")

The other (inverse) approach is this XPath expression:

div[@class='thread']/child::p"

which uses the direct child:: axis and only selects the direct child nodes.

Summary:
Which one of both expressions is faster depends on the XPath compiler. child:: is the default axis and is used if no other axis is given.


FYI: XPath counting starts at 1 and not 0.
So concerning your XML example, the following expression

count(//div[@class='thread'][1]/child::p)

does result in a value of 2 - the result of counting <p> <!-- 1 --> + <p><!-- 2 --></p>.

You can try PARENT.getchildren()

>>> root = etree.fromstring(xml)
>>> root.xpath("//div[@class='thread']")[0].getchildren()
[<Element p at 0x10b3110e0>, <Element p at 0x10b311ea8>]
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!