Combine XSLT and regex to find strings

旧时模样 提交于 2021-01-07 03:10:48

问题


I'm trying to find a way identify specific strings, punctuation, and similar in XML files, where those strings must sometimes appear within specific elements and sometimes not in specific elements. IOW I sometimes want to ignore <command> or <screen> or other elements.

Sample source XML:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % common_entities SYSTEM "../../../common.ent">
%common_entities;
]>
<section>
  <title>Summary</title>
  <para>Sample file.</para>
  <itemizedlist>
    <listitem>
      <para>No issues at all.</para>
    </listitem>
    <listitem>
      <para>Contains a command, <command>cd ../</command>, which contains valid orphan punctuation.</para>
    </listitem>
    <listitem>
      <para>Contains , random punctuation . in strange places, that should be identified.</para>
    </listitem>
  </itemizedlist>
<screen><prompt>[user@demo ~]$ </prompt><userinput>openstack , volume snapshot delete 53d27-2c10</userinput></screen>
  <para>
    The above screen element contains an orphan comma that should be ignored.
  </para>
</section>

XSL from @MichaelKay (I added the header info):

<?xml version="1.0"?>
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>

<!-- Match errors -->
<xsl:template match="entry/text()[matches(., '\s[.,:;?!]')]"
              mode="look-for-bad-punctuation" priority="5">
  <bad-punctuation-found/>
</xsl:template>

<!-- Match unchecked elements -->
<xsl:template match="screen/text() | command/text()"
             mode="look-for-bad-punctuation" priority="6">
  <xsl:copy-of select="."/>
</xsl:template>

<!-- Match elements with no error -->
<xsl:template match="text()"
             mode="look-for-bad-punctuation" priority="4">
  <xsl:copy-of select="."/>
</xsl:template>

</xsl:stylesheet>

Expected output:

Bad punctuation found: Contains ,

Bad punctuation found: random punctuation . etc.

If it can refer to line numbers that would be great.

What I'm getting at the moment is just the full text of the source file, minus all the DocBook elements, e.g: This sentence contains a command, cd ../, which contains valid orphan punctuation.

I'm using saxon-he-10.1.


回答1:


Your stylesheet contains the necessary rules, but it's missing the code that asks for the rules to be applied. Just add

<xsl:template match="/">
  <xsl:apply-templates select="//text()" mode="look-for-bad-punctuation"/>
</xsl:template>

You'll also need to do some fine-tuning of which elements are handled specially, e.g. screen/command/prompt/userinput.

For line numbers, Saxon-PE and higher offers an extension function saxon:line-number() - it also needs to be enabled using -l on the command line.



来源:https://stackoverflow.com/questions/63369012/combine-xslt-and-regex-to-find-strings

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!