Finding mixed XML content with regular expressions

拟墨画扇 提交于 2019-12-24 07:19:29

问题


During the running of my XSLT 2.0 stylesheet, I have need to find certain text (e.g., "story 3.1", "story 8.19", "story 21.76") and do something with it (e.g., wrap it in a hyperlink). Finding these instances and doing what I want with them were simple tasks. The problem I've run into though is that sometimes I might have mixed content that needs to be wrapped in the hyperlink (e.g., "story 3.1<i>a</i>"). I've not been able to figure out how to do that.

Here is some sample data and my template:

<p>Jack goes up the hill (story 3.1<i>a</i>) to fetch a pail of water.</p>

<xsl:template match="text()">
<xsl:variable name="content" as="xs:string" select="."/>
<xsl:analyze-string select="$content" regex="Story [0-9]*\.[0-9]*" flags="i">
  <xsl:matching-substring>
    <xsl:variable name="figureToTargetId">
      <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
        <xsl:matching-substring>
          <xsl:value-of select="concat('s',.)"/>
        </xsl:matching-substring>
      </xsl:analyze-string>
    </xsl:variable>
    <a href="#{$figureToTargetId}"><xsl:value-of select="."/></a>        
  </xsl:matching-substring>
  <xsl:non-matching-substring><xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

In the above case I'd want "story 3.1<i>a</i>" to be wrapped in the hyperlink.

I know to ideally get around this I'd have to match on something other than text(). I'm not sure what that is though.

One approach I've been exploring is looping through the text node set using xsl:for-each and testing whether the next text node is exactly one alpha character long. If it is, then wrap it in the same hyperlink as the previous text node. (For various reasons, I know that any one alpha character long text node after a text node that matches the above reg ex should be hyperlinked to the same target.) But I'm hoping there is a more elegant solution.


回答1:


This transformation:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>


 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p/text()[matches(., 'Story [0-9]+(\.[0-9]+)')]">
    <xsl:variable name="vCur" select="."/>
    <xsl:variable name="pContent" select="string(.)"/>
    <xsl:analyze-string select="$pContent" regex="Story [0-9]*\.[0-9]*" flags="i">
      <xsl:matching-substring>
        <xsl:variable name="figureToTargetId">
          <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
            <xsl:matching-substring>
              <xsl:value-of select="concat('s',.)"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <a href="#{$figureToTargetId}">
         <xsl:value-of select="."/>
         <xsl:if test="not(matches($vCur, 'Story [0-9]+(\.[0-9]+).+$'))">
          <xsl:sequence select="$vCur/following-sibling::*[1]"/>
         </xsl:if>
        </a>
      </xsl:matching-substring>
      <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
    </xsl:analyze-string>
 </xsl:template>
 <xsl:template match=
  "p/*[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]"/>
</xsl:stylesheet>

when applied on this document (the provided one extended to contain both interesting cases):

<t>
    <p>Little Red Riding Hood (Story 3.1) </p>
    <p>Jack goes up the hill (Story 3.1<i>a</i>) to fetch a pail of water.</p>
</t>

Produces the wanted, correct result:

<t>
      <p>Little Red Riding Hood (<a href="#s3.1">Story 3.1</a>) </p>
      <p>Jack goes up the hill (<a href="#s3.1">Story 3.1<i>a</i>
      </a>) to fetch a pail of water.</p>
</t>

Explanation:

We check to see if the matched substring is a suffix of the current text node -- if yes, then we also copy the first following sibling element.

Update:

In a comment the OP has set a new, additional requirement -- also change <i> to <em>.

This requires only a slight update to the above solution:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>


 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p/text()[matches(., 'Story [0-9]+(\.[0-9]+)')]">
    <xsl:variable name="vCur" select="."/>
    <xsl:variable name="pContent" select="string(.)"/>
    <xsl:analyze-string select="$pContent" regex="Story [0-9]*\.[0-9]*" flags="i">
      <xsl:matching-substring>
        <xsl:variable name="figureToTargetId">
          <xsl:analyze-string select="." regex="[0-9]*\.[0-9]*">
            <xsl:matching-substring>
              <xsl:value-of select="concat('s',.)"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <a href="#{$figureToTargetId}">
         <xsl:value-of select="."/>
         <xsl:if test="not(matches($vCur, 'Story [0-9]+(\.[0-9]+).+$'))">
          <xsl:apply-templates mode="match" select="$vCur/following-sibling::*[1]"/>
         </xsl:if>
        </a>
      </xsl:matching-substring>
      <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
    </xsl:analyze-string>
 </xsl:template>
 <xsl:template match=
  "p/*[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]"/>
 <xsl:template mode="match" match=
  "p/i[preceding-sibling::node()[1]
         [self::text()
        and
          matches(., 'Story [0-9]+(\.[0-9]+)$')]
         ]">
  <em><xsl:apply-templates/></em>
 </xsl:template>

</xsl:stylesheet>


来源:https://stackoverflow.com/questions/11923802/finding-mixed-xml-content-with-regular-expressions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!