Regex to replace single quote with single quote twice if it is inside <xsl: or <XSL:

匿名 (未验证) 提交于 2019-12-03 01:20:02

问题:

Regular Expression to replace ' with '' if it is inside <xsl: else ' should remain as it is.
Code Snippet:

public static void main(String[] args) {         String replaceSingleQuoteInsideXsltCondition = "(<\\s*?xsl\\s*?:.*?=.*?)(')(.*?)(')(.*?>)";         String dummyXSLT = "<p>Thank you for sending us <xsl:for-each select=\"catalog/cd[artist='Bob Dylan']\"> " +                 "paper's to prove your <span class=\"highlight\"><xsl:if test=\"D01 ='Y'\">Income</xsl:if></span> <span class=\"highlight\"><xsl:if test=\"D02 ='Y'\">&#160;and&#160;" +                 "</xsl:if></span><span class=\"highlight\"><xsl:if test=\"D03 ='Y'\">Citizenship and/or Identity</xsl:if></span>. " +                 "We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>" +                 "contact number for inquiry = '478965152' and email id = 'pqr@xyz'" +                 "<xsl:template match=\"num[ . = 3 or . = 5]\"/></xsl:stylesheet><xsl:if test=\"contains($search, 'Web Developer') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))\">" +                 "<xsl:if test=\"((node/ABC!='') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))\"> just a dummy sample.</xsl:if>";         System.out.println(dummyXSLT.replaceAll(replaceSingleQuoteInsideXsltCondition,  "$1''$3''$5"));     }

Actual Result by Above Code:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))"> just a dummy sample.</xsl:if>

Expected Result:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, ''Computer'') or contains($expSearch, ''Information'') or contains($expSearch, ''Web'' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='''') and (normalize-space(node/GHI)=''''))"> just a dummy sample.</xsl:if>

回答1:

I assume that it is ok to use a two different regex-replacements, one in a loop.
(The "g" modifier does not help.)

Here is the concept for java implementation for your usecase:

  • first replace all '' by '''',
    once but globally
  • replace (<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+) by \1''\3''\5, not globally but in a loop until it does not replace anything anymore
  • if that works, the next step is to make it accept xsl and also XSL and also allow the desired optional whitespace
    (<\\s*(xsl|XSL)([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)

I am no javaman (respectful pun intended), so I cannot offer a demonstrator in java.
Here is a demonstrator (you do not need it, just to show what I tested) in sed.
It implements above concept and has the desired output for the given sample input.

bash-3.1$ sed -En "1{s/''/''''/g;:a;s/(<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)/\1''\3''\5/;ta;p};" input.txt > output.txt

The main trick is to look for something which does NOT occur in an already successfully replaced part and then replace while successful.
The secondary trick is to first replace everything which needs to be replaced, but already looks replaced (''-> '''').

Note:
While java and sed have potentially different regex flavors, I don't see anything which obviously conflicts, when comparing your regex with mine. Mine does not even contain any \s \d \w or similar.
You might have to use your $1''$3''$5 instead of my \1''\3''\5.



回答2:

This is impossible if you allow arbitrary nesting of elements within the <xsl> </> tags. See RegEx match open tags except XHTML self-contained tags.

You could design a regex for this particular case, but not for every possible case.



回答3:

If you are just parsing the TAGS this works.
If you are trying to interpret HTML closure, it can't be done with Java
regex.

The basic idea is that you can't just parse xsl tags. All tags must be parsed
to advance the match position and go past tags that may hide html.

So, all tags must be parsed.
In the regex below, Capture Group 2 contains the xsl tags you want to find.

All tags will be matched. You can ignore those and just look for when
capture group 2 has length. That is the one you want to manipulate.

What we do is a Replace All with a Callback.

Inside the callback:

  • If capture group 2 did not match (i.e. has no length)
    just return the contents of capture group 0 (the match).
    This just replaces with what matched. These are the other tags.

  • If capture group 2 did match copy group 2 to a string
    and run another regex replace on that strinG (it's contents).
    That would be a global Find (?<!')'(?!') Replace ''.
    Return that string as the replacement in the callback.

That's all there is to it.

Hold on to your yourself now.
This is the regex.

(Feel free to make this case insensitive if you want)

"<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(xsl:[\\w:-]*\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"

Expanded

 <  (?:       (?:            (?:                 # Invisible content; end tag req'd                 (                             # (1 start)                      script                   |  style                      #|  head                   |  object                   |  embed                   |  applet                   |  noframes                   |  noscript                   |  noembed                  )                             # (1 end)                 (?:                      \s+                       (?>                           " [\S\s]*? "                        |  ' [\S\s]*? '                        |  (?:                                (?! /> )                                [^>]                            )?                      )+                 )?                 \s* >            )             [\S\s]*? </ \1 \s*             (?= > )       )     |  (?: /? [\w:]+ \s* /? )     |  (                             # (2 start), The xsl: we want to find            xsl: [\w:-]*             \s+             (?:                 " [\S\s]*? "               |  ' [\S\s]*? '               |  [^>]?             )+            \s* /?       )                             # (2 end)    |  (?:            [\w:]+             \s+             (?:                 " [\S\s]*? "               |  ' [\S\s]*? '               |  [^>]?             )+            \s* /?       )    |  \? [\S\s]*? \?    |  (?:            !            (?:                 (?: DOCTYPE [\S\s]*? )              |  (?: \[CDATA\[ [\S\s]*? \]\] )              |  (?: -- [\S\s]*? -- )              |  (?: ATTLIST [\S\s]*? )              |  (?: ENTITY [\S\s]*? )              |  (?: ELEMENT [\S\s]*? )            )       )  )  >

Final note - To see how effective and quick this regex is,
get a large html source code. Run a global find and replace with ''.
You will now see all the content, totally stripped of html.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!