XPath or XQuery to exclude article sections which only contains lists

ぃ、小莉子 提交于 2019-12-22 17:47:31

问题


I am trying to extract the sections of an article (Introduction, History, Overview....). I look for an XPath to select all the sections which begin with a heading and contain some paragraphs. If they only contain a list, they should be discarded.

For example :

<h2>Intro</h2>
<p> It has paragraph and should be extracted </p>
.....
<h2>References </h2>
<ul>...It has just list and should be discarded </ul>
<h2>...</h2>
....

If XPath is not possible, an XQuery could also work. I tried the following XQuery

for $x in doc("test.xq")//h2
return
   <section>{$x/following-sibling::*[preceding-sibling::h2[1] is $x]}</section>

It selects the sections as I want, but I couldn't impose the condition (not only ul) to it.


回答1:


You mention in another question that this is in BaseX, which supports the XQuery 3.0 group by mechanism, so how about this:

for $x in doc("test.xq")//h2/following-sibling::*[not(self::h2)]
group by $hId := generate-id($x/preceding-sibling::h2[1])
return
  if ($x[not(self::ul)]) then
    <section>{($x/preceding-sibling::h2[1], $x)}</section>
  else ()

Here I'm first finding all the non-h2 elements that we want to gather together (there may be a more efficient way to do this depending on the structure of your XML), then the group by means that on each "iteration" the $x variable will be the sequence of non-h2 elements between one h2 and the next. The if condition then checks whether there is at least one element in this group that is not a ul.




回答2:


Unfortunatly, in this case there is no condition to create xpath.

You should scan the tree. When h2 found, begin to collect fragment. If you meet p before h2 mark the fragment to save, else drop it and begin saving from that h2.

It can be done both using dom structure or with text searching of <h and <p.



来源:https://stackoverflow.com/questions/30710968/xpath-or-xquery-to-exclude-article-sections-which-only-contains-lists

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!