Normalize space issue with html tags

北战南征 提交于 2019-12-24 08:34:38

问题


Here's one for you XSLT gurus :-)

I have to deal with XML output from a Java program I cannot control.

In the docs outputted by this app the html tags remain as

<u><i><b><em>  

etc, instead of

&lt;u&gt;&lt;i&gt;&lt;b&gt;&lt;em&gt; and so on.

That's not a massive problem, I use XSLT to fix that, but using normalize-space to remove excess whitespace also removes spaces before these html tags.

Example

<Locator Precode="7">
<Text LanguageId="7">The next word is <b>bold</b> and is correctly spaced 
around the html tag,
but the sentence has extra whitespace and 
line breaks</Text>
</Locator>

If I run the XSLT script we use to remove extra white space, of which this is the relevant part

<xsl:template match="text(.)">
<xsl:value-of select="normalize-space()"/>
</xsl:template>

In the resulting output the xslt has correctly removed the extra whitespace and the line breaks, but it has also removed the space before the tag resulting in this output :-

The next word isboldand is correctly spaced around the html tag, but the sentence has extra whitespace and line breaks.

The spacing before and after the word "bold" has been stripped as well.

Anyone have any ideas how to prevent this from happening? Pretty well at my wits end so any help will be greatly appreciated!

:-)

Hi again,

Yes of course, here's the full stylesheet. We have to deal with the html tags and spacing in one pass

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no" encoding="UTF-8"/>
<xsl:strip-space elements="*" />  


<xsl:template match="@*|node()">
 <xsl:copy> 
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>


<xsl:template match="Text//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="normalize-space(.)"/>
</xsl:template>


<xsl:template match="Instruction//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>

<xsl:template match="Title//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>


</xsl:stylesheet>

回答1:


An XSLT 1.0 solution is an XPath expression to replace a sequence of several whitespace characters with a single one. The idea is not my own, it is taken from an answer by Dimitre Novatchev.

The advantage over the built-in normalize-space() function is that trailing whitespace (in your case, before and after the b element) is kept.

EDIT: As a response to you editing your question. Below is the said XPath expression incorporated into your stylesheet. Also:

  • Explicitly saying omit-xml-declaration="no" is redundant. It is the default action taken by the XSLT processor
  • Several of your templates have the same content. I summarized them using | to a single one.

Stylesheet

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8"/>
<xsl:strip-space elements="*" />  


<xsl:template match="@*|node()">
 <xsl:copy> 
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>


<xsl:template match="Text//*|Instruction//*|Title//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>

<xsl:template match="text()">
  <xsl:value-of select=
  "concat(substring(' ', 1 + not(substring(.,1,1)=' ')),
          normalize-space(),
          substring(' ', 1 + not(substring(., string-length(.)) = ' '))
          )
  "/>
  </xsl:template>

</xsl:stylesheet>

XML Output

<?xml version="1.0" encoding="UTF-8"?>
<Locator Precode="7">
   <Text LanguageId="7">The next word is &lt;b&gt;bold&lt;/b&gt; and is correctly spaced around the html tag, but the sentence has extra whitespace and line breaks</Text>
</Locator>


来源:https://stackoverflow.com/questions/25530078/normalize-space-issue-with-html-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!