RegEx: Don't match a certain character if it's inside quotes

こ雲淡風輕ζ 提交于 2019-12-04 05:55:40
Vasili Syrakis

Regular Expression:

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

Online demo:

http://regex101.com/r/yX5xS8

Full Explanation:

I know this regex might be a headache to look at, so here is my explanation:

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags

This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.

It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).

(<.+?>[^<]+>)|(<.+?>)

you can make two regexs than put them togather by using '|', in this case :

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

if the first case match, it will not use second regex, so make sure you put special case in the firstplace.

If you want this to work with escaped double quotes, try:

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

For example:

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

And if you're parsing through a bunch of XML, you'll want to set .lastIndex.

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!