Finding mixed XML content with regular expressions

问题

During the running of my XSLT 2.0 stylesheet, I have need to find certain text (e.g., "story 3.1", "story 8.19", "story 21.76") and do something with it (e.g., wrap it in a hyperlink). Finding these instances and doing what I want with them were simple tasks. The problem I've run into though is that sometimes I might have mixed content that needs to be wrapped in the hyperlink (e.g., "story 3.1a"). I've not been able to figure out how to do that.

Here is some sample data and my template:

<p>Jack goes up the hill (story 3.1<i>a</i>) to fetch a pail of water.</p>

<xsl:template match="text()">
<xsl:variable name="content" as="xs:string" select="."/>
<xsl:analyze-string select="$content" regex="Story [0-9]*\.[0-9]*" flags="i">
  <xsl:matching-substring>
    <xsl:variable name="figureToTargetId">
      <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
        <xsl:matching-substring>
          <xsl:value-of select="concat('s',.)"/>
        </xsl:matching-substring>
      </xsl:analyze-string>
    </xsl:variable>
    <a href="#{$figureToTargetId}"><xsl:value-of select="."/></a>        
  </xsl:matching-substring>
  <xsl:non-matching-substring><xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

In the above case I'd want "story 3.1a" to be wrapped in the hyperlink.

I know to ideally get around this I'd have to match on something other than text(). I'm not sure what that is though.

One approach I've been exploring is looping through the text node set using xsl:for-each and testing whether the next text node is exactly one alpha character long. If it is, then wrap it in the same hyperlink as the previous text node. (For various reasons, I know that any one alpha character long text node after a text node that matches the above reg ex should be hyperlinked to the same target.) But I'm hoping there is a more elegant solution.

回答1:

This transformation:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>


 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p/text()[matches(., 'Story [0-9]+(\.[0-9]+)')]">
    <xsl:variable name="vCur" select="."/>
    <xsl:variable name="pContent" select="string(.)"/>
    <xsl:analyze-string select="$pContent" regex="Story [0-9]*\.[0-9]*" flags="i">
      <xsl:matching-substring>
        <xsl:variable name="figureToTargetId">
          <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
            <xsl:matching-substring>
              <xsl:value-of select="concat('s',.)"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <a href="#{$figureToTargetId}">
         <xsl:value-of select="."/>
         <xsl:if test="not(matches($vCur, 'Story [0-9]+(\.[0-9]+).+$'))">
          <xsl:sequence select="$vCur/following-sibling::*[1]"/>
         </xsl:if>
        </a>
      </xsl:matching-substring>
      <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
    </xsl:analyze-string>
 </xsl:template>
 <xsl:template match=
  "p/*[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]"/>
</xsl:stylesheet>

when applied on this document (the provided one extended to contain both interesting cases):

<t>
    <p>Little Red Riding Hood (Story 3.1) </p>
    <p>Jack goes up the hill (Story 3.1<i>a</i>) to fetch a pail of water.</p>
</t>

Produces the wanted, correct result:

<t>
      <p>Little Red Riding Hood (<a href="#s3.1">Story 3.1</a>) </p>
      <p>Jack goes up the hill (<a href="#s3.1">Story 3.1<i>a</i>
      </a>) to fetch a pail of water.</p>
</t>

Explanation:

We check to see if the matched substring is a suffix of the current text node -- if yes, then we also copy the first following sibling element.

Update:

In a comment the OP has set a new, additional requirement -- also change  to .

This requires only a slight update to the above solution:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>


 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p/text()[matches(., 'Story [0-9]+(\.[0-9]+)')]">
    <xsl:variable name="vCur" select="."/>
    <xsl:variable name="pContent" select="string(.)"/>
    <xsl:analyze-string select="$pContent" regex="Story [0-9]*\.[0-9]*" flags="i">
      <xsl:matching-substring>
        <xsl:variable name="figureToTargetId">
          <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
            <xsl:matching-substring>
              <xsl:value-of select="concat('s',.)"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <a href="#{$figureToTargetId}">
         <xsl:value-of select="."/>
         <xsl:if test="not(matches($vCur, 'Story [0-9]+(\.[0-9]+).+$'))">
          <xsl:apply-templates mode="match" select="$vCur/following-sibling::*[1]"/>
         </xsl:if>
        </a>
      </xsl:matching-substring>
      <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
    </xsl:analyze-string>
 </xsl:template>
 <xsl:template match=
  "p/*[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]"/>
 <xsl:template mode="match" match=
  "p/i[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]">
  <em><xsl:apply-templates/></em>
 </xsl:template>

</xsl:stylesheet>

来源：https://stackoverflow.com/questions/11923802/finding-mixed-xml-content-with-regular-expressions

标签

xml

xslt

xslt-2.0