XPath to first occurrence of element with text length >= 200 characters

南笙酒味 提交于 2019-12-06 03:40:18

问题


How do I get the first element that has an inner text (plain text, discarding other children) of 200 or more characters in length?

I'm trying to create an HTML parser like Embed.ly and I've set up a system of fallbacks where I first check for og:description, then I would search for this occurrence and only then for the description meta tag.

This is because most sites that even include meta description describe their site in that tag, instead of the contents of the current page.

Example:

<html>
    <body>
        <div>some characters
            <p>200 characters <span>some more stuff</span></p>
        </div>
    </body>
</html>

What selector could I use to get the 200 characters portion of that HTML fragment? I don't want the some more stuff either, I don't care what element it is (except for <script> or <style>), as long as it's the first plain text to contain at least 200 characters.

What should the XPath query look like?


回答1:


Use:

(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]

Note: In case the document is an XHTML document (and that means all elements are in the xhrml namespace), the above expression should be specified as:

(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]

where the prefix "x:" must be bound to the XHTML namespace -- "http://www.w3.org/1999/xhtml" (or as many XPath APIs call this -- the namespace must be "Registered" with this prefix)




回答2:


I meant something like this:

root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")

Seems to work pretty well.




回答3:


HTML is not XML. You should not use XML parsers to parse HTML period. They are two different things entirely, and your parser will choke out the first time you see html that's not well formed XML.

You should find an opensource HTML parser instead of rolling your own.



来源:https://stackoverflow.com/questions/9576505/xpath-to-first-occurrence-of-element-with-text-length-200-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!