RegEx: Don't match a certain character if it's inside quotes

我的梦境 提交于 2019-12-09 16:53:43

问题


Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.

Say I have this string:

some text <tag link="fo>o"> other text

I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.

How can I make sure that > inside of quotes can be ignored.

I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.


回答1:


Regular Expression:

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

Online demo:

http://regex101.com/r/yX5xS8

Full Explanation:

I know this regex might be a headache to look at, so here is my explanation:

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags



回答2:


This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.

It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).




回答3:


(<.+?>[^<]+>)|(<.+?>)

you can make two regexs than put them togather by using '|', in this case :

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

if the first case match, it will not use second regex, so make sure you put special case in the firstplace.




回答4:


If you want this to work with escaped double quotes, try:

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

For example:

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

And if you're parsing through a bunch of XML, you'll want to set .lastIndex.

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes


来源:https://stackoverflow.com/questions/22164374/regex-dont-match-a-certain-character-if-its-inside-quotes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!