问题
I found an object with specific class:
THREAD = TREE.find_class('thread')[0]
Now I want to get all <p>
elements that are its direct children.
I tired:
THREAD.findall("p")
THREAD.xpath("//div[@class='thread']/p")
But all of those returns all <p>
elements inside this <div>
, no matter if that <div>
is their closest parent or not.
How can I make it work?
Edit:
Sample html:
<div class='thread'>
<p> <!-- 1 -->
<!-- Can be some others <p> objects inside, which should not be counted -->
</p>
<p><!-- 2 --></p>
</div>
<div class='thread'>
<p>[...]</p>
<p>[...]</p>
</div>
script should find two objects <p>
, which are children of THREAD
. I should receive list of two objects, marked as "1" and "2" in comments in sample HTML.
Edit 2:
Yet another clarification, since people get confused:
THREAD
is some object stored in variable, can be any html element. I want to find <p>
objects that are direct children of THREAD
. Those <p>
's can not be outside THREAD
or inside any element that's also inside THREAD
.
回答1:
I'm not sure, but it seem that your problem is in HTML itself: note that there are couple Tag omission cases applicable for p nodes, so closing tags of paragraphs
<div class='thread'>
<p>first
<p>second</p>
</p>
</div>
simply ignored by parser and both nodes identified as siblings, but not parent and child, e.g.
<div class='thread'>
<p>first
<p>second
</div>
So XPath //div[@class="thread"]/p
will return you both paragraphs
You can simply replace p
tags with div
tags and you'll see different behaviour:
<div class='thread'>
<div>first
<div>second</div>
</div>
</div>
Here //div[@class="thread"]/div
will return first node only
Please correct me if my assumption is incorrect
回答2:
Try this XPath expression:
//p[parent::div[@class='thread']]
Or in a complete Python expression:
THREAD.xpath("//p[parent::div[@class='thread']]")
The other (inverse) approach is this XPath expression:
div[@class='thread']/child::p"
which uses the direct child::
axis and only selects the direct child nodes.
Summary:
Which one of both expressions is faster depends on the XPath compiler. child::
is the default axis and is used if no other axis is given.
FYI: XPath counting starts at 1 and not 0.
So concerning your XML example, the following expression
count(//div[@class='thread'][1]/child::p)
does result in a value of 2 - the result of counting <p> <!-- 1 -->
+ <p><!-- 2 --></p>
.
回答3:
You can try PARENT.getchildren()
>>> root = etree.fromstring(xml)
>>> root.xpath("//div[@class='thread']")[0].getchildren()
[<Element p at 0x10b3110e0>, <Element p at 0x10b311ea8>]
来源:https://stackoverflow.com/questions/48548296/how-to-find-direct-children-of-element-in-lxml