Extract text in HTML comment using XPath and regex

你离开我真会死。 提交于 2019-12-06 04:24:30

问题


I'm trying to parse HTML files using an XML/HTML parser which contain hidden commented text for translation, namely X and Y below.

<!-- Title: “ X ” Tags: “ Y ” -->

Which XPath would best match X and Y? The //comment() function matches the whole node but I need to match the two occurences of text between and quotes.

I guess one would need a combination of XPath and regular expressions to do that but I'm not sure how to tackle that.


回答1:


I assume that the quotes in the comment are the same, regular qoute character " -- not the typographically different starting and ending quote that appears when this question is displayed.

In case this assumption is wrong, simply replace the standard quote in the below expressions with the respective quote.


Use (if the comment in question is the first one in the document):

substring-before(substring-after(//comment(), '"'), '"')

This produces the string (without the quotes):

" X "

And for the second string in quotes use:

substring-before(
   substring-after(
        substring-after(
               substring-after(//comment(), '"'), 
               '"'), 
        '"'), 
   '"')

XSLT - based verification (Because an XSLT stylesheet must be a well-formed XML document we replace the quotes in the expressions with the entity &quot; -- just to avoid errors due to nested quotes):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     "<xsl:copy-of select="substring-before(substring-after(//comment(), '&quot;'), '&quot;')"/>"
=============
   "<xsl:copy-of select=
   "substring-before(substring-after(substring-after(substring-after(//comment(), '&quot;'), '&quot;'), '&quot;'), '&quot;')"/>"
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied against this XML document:

<html>
  <body>
    Hello.
<!-- Title: " X " Tags: " Y " -->
  </body>
</html>

the two XPath expressions are evaluated and the results of these two evaluations are copied to the output (surrounded by quotes to show the exact strings copied):

     " X "
=============
   " Y "


来源:https://stackoverflow.com/questions/12859155/extract-text-in-html-comment-using-xpath-and-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!