问题
I am trying to extract the sections of an article (Introduction, History, Overview....). I look for an XPath to select all the sections which begin with a heading and contain some paragraphs. If they only contain a list, they should be discarded.
For example :
<h2>Intro</h2>
<p> It has paragraph and should be extracted </p>
.....
<h2>References </h2>
<ul>...It has just list and should be discarded </ul>
<h2>...</h2>
....
If XPath is not possible, an XQuery could also work. I tried the following XQuery
for $x in doc("test.xq")//h2
return
<section>{$x/following-sibling::*[preceding-sibling::h2[1] is $x]}</section>
It selects the sections as I want, but I couldn't impose the condition (not only ul) to it.
回答1:
You mention in another question that this is in BaseX, which supports the XQuery 3.0 group by mechanism, so how about this:
for $x in doc("test.xq")//h2/following-sibling::*[not(self::h2)]
group by $hId := generate-id($x/preceding-sibling::h2[1])
return
if ($x[not(self::ul)]) then
<section>{($x/preceding-sibling::h2[1], $x)}</section>
else ()
Here I'm first finding all the non-h2 elements that we want to gather together (there may be a more efficient way to do this depending on the structure of your XML), then the group by means that on each "iteration" the $x variable will be the sequence of non-h2 elements between one h2 and the next. The if condition then checks whether there is at least one element in this group that is not a ul.
回答2:
Unfortunatly, in this case there is no condition to create xpath.
You should scan the tree. When h2 found, begin to collect fragment. If you meet p before h2 mark the fragment to save, else drop it and begin saving from that h2.
It can be done both using dom structure or with text searching of <h and <p.
来源:https://stackoverflow.com/questions/30710968/xpath-or-xquery-to-exclude-article-sections-which-only-contains-lists