How to split text and preserve HTML tags (XSLT 2.0)

有些话、适合烂在心里 提交于 2019-12-05 15:57:59
Jukka Matilainen

Here is one way to implement the second approach suggested by Michael Kay using XSLT 2.

This stylesheet demonstrates a two-pass transformation where the first pass introduces <stop/> markers after each sentence and the second pass encloses all groups ending with a <stop/> in a paragraph.

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes"/>

  <!-- two-pass processing -->
  <xsl:template match="/">
    <xsl:variable name="intermediate">
      <xsl:apply-templates mode="phase-1"/>
    </xsl:variable>
    <xsl:apply-templates select="$intermediate" mode="phase-2"/>
  </xsl:template>

  <!-- identity transform -->
  <xsl:template match="@*|node()" mode="#all" priority="-1">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" mode="#current"/>
    </xsl:copy>
  </xsl:template>

  <!-- phase 1 -->

  <!-- insert <stop/> "milestone markup" after each sentence -->
  <xsl:template match="text()" mode="phase-1">
    <xsl:analyze-string select="." regex="\.\s+">
      <xsl:matching-substring>
        <xsl:value-of select="regex-group(0)"/>
        <stop/>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

  <!-- phase 2 -->

  <!-- turn each <stop/>-terminated group into a paragraph -->
  <xsl:template match="*[stop]" mode="phase-2">
    <xsl:copy>
      <xsl:for-each-group select="node()" group-ending-with="stop">
        <p>
          <xsl:apply-templates select="current-group()" mode="#current"/>
        </p>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <!-- remove the <stop/> markers -->
  <xsl:template match="stop" mode="phase-2"/>

</xsl:stylesheet>

A good question, and not an easy one to solve. Especially, of course, if you're using XSLT 1.0 (you really need to tell us if that's the case).

I've seen two approaches to the problem. Both involve breaking it into smaller problems.

The first approach is to convert the markup into text (for example replace <b>first</b> by [b]first[/b]), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.

The second approach (which I personally prefer) is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)

This is my humble solution, based on the second suggestion of @Michael Kay answer.

Differently from @Jukka answer (which is very elegant indeed) I'm not using xsl:analyse-string, as XPath 1.0 functions contains and substring-after are enough to accomplish the split. I've also started the match pattern from the config.

Here's the transform:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <!-- two pass processing -->
    <xsl:template match="config">
        <xsl:variable name="pass1">
            <xsl:apply-templates select="node()"/>
        </xsl:variable>
        <xsl:apply-templates mode="pass2" select="$pass1/*"/>
    </xsl:template>

    <!-- 1. Copy everything as is (identity) -->
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <!-- 1. Replace "text. text" with "text<dot/> text" -->
    <xsl:template match="text()[contains(.,'. ')]">
        <xsl:value-of select="substring-before(.,'. ')"/>
        <dot/>
        <xsl:value-of select="substring-after(.,'. ')"/>
    </xsl:template>

    <!-- 2. Group by examining in population order ending with dot -->
    <xsl:template match="desc" mode="pass2">
        <xsl:for-each-group select="node()" 
            group-ending-with="dot">
            <p><xsl:apply-templates select="current-group()" mode="pass2"/></p>
        </xsl:for-each-group>
    </xsl:template>

    <!-- 2. Identity -->
    <xsl:template match="node()|@*" mode="pass2">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*" mode="pass2"/>
        </xsl:copy>
    </xsl:template>

    <!-- 2. Replace dot with mark -->
    <xsl:template match="dot" mode="pass2">
        <xsl:text>.</xsl:text>
    </xsl:template>

</xsl:stylesheet>

Applied on the input shown in your question, produces:

<p>A <b>first</b> sentence here.</p>
<p>The second sentence with some link <a href="myurl">The link</a>.</p>
<p>The <u>third</u> one.</p>
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!