Extract text in HTML comment using XPath and regex

问题

I'm trying to parse HTML files using an XML/HTML parser which contain hidden commented text for translation, namely X and Y below.

<!-- Title: “ X ” Tags: “ Y ” -->

Which XPath would best match X and Y? The //comment() function matches the whole node but I need to match the two occurences of text between “ and ” quotes.

I guess one would need a combination of XPath and regular expressions to do that but I'm not sure how to tackle that.

回答1:

I assume that the quotes in the comment are the same, regular qoute character " -- not the typographically different starting and ending quote that appears when this question is displayed.

In case this assumption is wrong, simply replace the standard quote in the below expressions with the respective quote.

Use (if the comment in question is the first one in the document):

substring-before(substring-after(//comment(), '"'), '"')

This produces the string (without the quotes):

" X "

And for the second string in quotes use:

substring-before(
   substring-after(
        substring-after(
               substring-after(//comment(), '"'), 
               '"'), 
        '"'), 
   '"')

XSLT - based verification (Because an XSLT stylesheet must be a well-formed XML document we replace the quotes in the expressions with the entity " -- just to avoid errors due to nested quotes):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     "<xsl:copy-of select="substring-before(substring-after(//comment(), '&quot;'), '&quot;')"/>"
=============
   "<xsl:copy-of select=
   "substring-before(substring-after(substring-after(substring-after(//comment(), '&quot;'), '&quot;'), '&quot;'), '&quot;')"/>"
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied against this XML document:

<html>
  <body>
    Hello.
<!-- Title: " X " Tags: " Y " -->
  </body>
</html>

the two XPath expressions are evaluated and the results of these two evaluations are copied to the output (surrounded by quotes to show the exact strings copied):

     " X "
=============
   " Y "

来源：https://stackoverflow.com/questions/12859155/extract-text-in-html-comment-using-xpath-and-regex