How to check if xml textnode has Chinese characters with RegEx in a XSLT

倾然丶 夕夏残阳落幕 提交于 2019-12-10 16:09:00

问题


On this website http://gskinner.com/RegExr/ (which is a RegEx test website) this regex match works Match: [^\x00-\xff]
Sample Text: test123 或元件数据不可用

But if I have this input XML:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
  <node>test123 或元件数据不可用</node>
</root>

and I try this XSLT 2.0 stylesheet with Saxon 9:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/root/node">
    <xsl:if test="matches(., '[^\x00-\xff]')">
      <xsl:text>Text has chinese characters!</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

Saxon 9 gives me following error output:

    FORX0002: Error at character 3 in regular expression "[^\x00-\xff]": invalid escape sequence
  Failed to compile stylesheet. 1 error detected.

How to check for chinese characters inside XSLT 2.0 ?


回答1:


The regex dialect supported by XPath is based on that defined in XSD: you can find full specifications in the W3C documents, or if you prefer something more readable, in my XSLT 2.0 Programmer's Reference. Don't assume that all regex dialects are the same. There's no \x escape in XPath regexen because it's designed for embedding in XML which already offers &#xHHHH;.

Rather than using a hex range you might find it more convenient to use a named Unicode block, for example \p{IsCJKUnifiedIdeographs}.

See also What's the complete range for Chinese characters in Unicode?




回答2:


With the help from Michael Kay I can answer my question myself. Thanks Michael! The solution works but in my opinion this long Unicode ranges do not look very pretty.

This XSLT will print a text message if any Chinese character were found with regular expressions in the given XML:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/root/node">
    <xsl:if test="matches(.,'[&#x4E00;-&#x9FFF;&#x3400;-&#x4DFF;&#x20000;-&#x2A6DF;&#xF900;-&#xFAFF;&#x2F800;-&#x2FA1F;]')">
      <xsl:text>Text has chinese characters!</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

Solution with named Unicode block:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/root/node">
    <xsl:if test="matches(., '[\p{IsCJKUnifiedIdeographs}\p{IsCJKUnifiedIdeographsExtensionA}\p{IsCJKUnifiedIdeographsExtensionB}\p{IsCJKCompatibilityIdeographs}\p{IsCJKCompatibilityIdeographsSupplement}]')">
      <xsl:text>Text has chinese characters!</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>


来源:https://stackoverflow.com/questions/6611839/how-to-check-if-xml-textnode-has-chinese-characters-with-regex-in-a-xslt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!