How to find direct children of element in lxml

问题

I found an object with specific class:

THREAD = TREE.find_class('thread')[0]

Now I want to get all <p> elements that are its direct children.

I tired:

THREAD.findall("p")

THREAD.xpath("//div[@class='thread']/p")

But all of those returns all <p> elements inside this <div>, no matter if that <div> is their closest parent or not.

How can I make it work?

Edit:

Sample html:

<div class='thread'>
   <p> <!-- 1 -->
      <!-- Can be some others <p> objects inside, which should not be counted -->
   </p> 
   <p><!-- 2 --></p>
</div>
<div class='thread'>
   <p>[...]</p>
   <p>[...]</p>
</div>

script should find two objects <p>, which are children of THREAD. I should receive list of two objects, marked as "1" and "2" in comments in sample HTML.

Edit 2:

Yet another clarification, since people get confused:

THREAD is some object stored in variable, can be any html element. I want to find <p> objects that are direct children of THREAD. Those <p>'s can not be outside THREAD or inside any element that's also inside THREAD.

回答1:

I'm not sure, but it seem that your problem is in HTML itself: note that there are couple Tag omission cases applicable for p nodes, so closing tags of paragraphs

<div class='thread'>
    <p>first
        <p>second</p>
    </p>
</div>

simply ignored by parser and both nodes identified as siblings, but not parent and child, e.g.

<div class='thread'>
    <p>first
    <p>second
</div>

So XPath //div[@class="thread"]/p will return you both paragraphs

You can simply replace p tags with div tags and you'll see different behaviour:

<div class='thread'>
    <div>first
        <div>second</div>
    </div>
</div>

Here //div[@class="thread"]/div will return first node only

Please correct me if my assumption is incorrect

回答2:

Try this XPath expression:

//p[parent::div[@class='thread']]

Or in a complete Python expression:

THREAD.xpath("//p[parent::div[@class='thread']]")

The other (inverse) approach is this XPath expression:

div[@class='thread']/child::p"

which uses the direct child:: axis and only selects the direct child nodes.

Summary:
Which one of both expressions is faster depends on the XPath compiler. child:: is the default axis and is used if no other axis is given.

FYI: XPath counting starts at 1 and not 0.
So concerning your XML example, the following expression

count(//div[@class='thread'][1]/child::p)

does result in a value of 2 - the result of counting <p>  + <p></p>.

回答3:

You can try PARENT.getchildren()

>>> root = etree.fromstring(xml)
>>> root.xpath("//div[@class='thread']")[0].getchildren()
[<Element p at 0x10b3110e0>, <Element p at 0x10b311ea8>]

来源：https://stackoverflow.com/questions/48548296/how-to-find-direct-children-of-element-in-lxml

标签

python

xpath

lxml