Using regex to match string between two strings while excluding strings

后端 未结 6 1889
隐瞒了意图╮
隐瞒了意图╮ 2020-12-06 19:41

Following on from a previous question in which I asked:

How can I use a regular expression to match text that is between two strings, where those two

6条回答
  •  佛祖请我去吃肉
    2020-12-06 20:27

    Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

    The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.

    So what do we not want?

    First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:

    outer-start(?:(?!inner-start).)*?unwanted.*?outer-end
    

    This will be to the left of the first |. It matches a whole outer block.

    Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:

    outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
    

    This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.

    Third, we match and capture what we want. This is:

    inner-start\s*(text-that-i-want)\s*inner-end
    

    So the whole regex, in free-spacing mode, is:

    (?xs)
    outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
    | # OR (also don't want that)
    outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
    | # OR capture what we want
    inner-start\s*(text-that-i-want)\s*inner-end
    

    On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.

    In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:

    (?xs)
    (?: # non-capture group: the things we don't want
    outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
    | # OR (also don't want that)
    outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
    )
    (*SKIP)(*F) # we don't want this, so fail and skip
    | # OR capture what we want
    inner-start\s*\Ktext-that-i-want(?=\s*inner-end)
    

    See demo: it directly matches what you want.

    The technique is explained in full detail in the question and article below.

    Reference

    • How to match (or replace) a pattern except in situations s1, s2, s3...
    • Article about matching a pattern unless...

提交回复
热议问题