How the Look-ahead and Look-behind concept supports such Zero-Width Assertions concept in Regex of Ruby?

一曲冷凌霜 提交于 2019-12-30 04:39:05

问题


I just gone through the concept Zero-Width Assertions from the documentation. And some quick questions comes into my mind-

  • why such name Zero-Width Assertions?
  • How the Look-ahead and look-behind concept supports such Zero-Width Assertions concept?
  • What such ?<=s,<!s,=s,<=s - 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on

I also tried some tiny codes to understand the logic, but not that much confident with the output of those:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Can anyone help me here to understand?

EDIT

Here i have tried two snippets one with "Zero-Width Assertions" concepts as below:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

and the other is without "Zero-Width Assertions" concepts as below:

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Both the above produces same output,now internally how the both regexp move by their own to produce output- could you help me to visualize?

Thanks


回答1:


Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.

A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.

You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.

Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.

Consider:

/(?=foo)foo/.match 'foo'

This will match! The regex engine goes like this:

  1. Start at the beginning of the string: |foo.
  2. The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
  3. Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
  4. Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
  5. Same thing again, bringing us to foo|.
  6. We reached the end of the regex, and nothing failed, so the pattern matches.

On your four assertions in particular:

  • (?=...) is "lookahead"; it asserts that ... does appear after the cursor.

    1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
     => "slump june" 
    

    The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.

    Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.

  • (?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.

    1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
     => "five flour" 
    

    The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.

    Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.

  • (?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.

    1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
     => "kid children"
    

    "child" matches, because what comes next is a space, not "ren". "children" doesn't.

    This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.

  • (?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.

    1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
     => "feet root" 
    

    The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".

    As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.

You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)


Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.

They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".


Regarding your later examples:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Yes, these produce the same output. This is that tricky bit with using lookahead:

  1. The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
  2. It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
  3. It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
  4. We're done!

The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.

Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!

This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.

This is what you want:

1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
 => "foresight" 

This asserts that s doesn't come before ight.




回答2:


Zero-width assertions are difficult to understand until you realize that regex matches positions as well as characters.

When you see the string "foo" you naturally read three characters. But, there are also four positions, marked here by pipes: "|f|o|o|". A lookahead or lookbehind (aka lookarounds) match a position where the character before or after match the expression.

The difference between a zero-width expression and other expressions is that the zero-width expression only matches (or "consumes") the position. So, for example:

/(app)apple/

will fail to match "apple" because it's trying to match "app" twice. But

/(?=app)apple/

will succeed because the lookahead is only matching the position where "app" follows. It doesn't actually match the "app" character, allowing the next expression to consume them.

LOOKAROUND DESCRIPTIONS

Positive Lookahead: (?=s)

Imagine you are a drill sergeant and you are performing an inspection. You begin at the front of the line with the intention of walking past each private and ensuring they meet expectations. But, before doing so, you look ahead one by one to make sure they have lined up in the property order. The privates' names are "A", "B", "C", "D" and "E". /(?=ABCDE)...../.match('ABCDE'). Yep, they are all present and accounted for.

Negative Lookahead: (?!s)

You perform the inspection down the line and are finally standing at private D. Now you are going to look ahead to make sure that "F" from the other company has not, yet again, accidentally slipped into the wrong formation. /.....(?!F)/.match('ABCDE'). Nope, he hasn't slipped in this time, so all is well.

Positive Lookbehind: (?<=s)

After completing the inspection, the sergeant is at the end of the formation. He turns and scans back to make sure no one has snuck away. /.....(?<=ABCDE)/.match('ABCDE'). Yep, everyone is present and accounted for.

Negative Lookbehind: (?<!s)

Finally, the drill sergeant takes one last look to make sure that privates A and B have not, once again, switched places (because they like KP). /.....(?<!BACDE)/.match('ABCDE'). Nope, they haven't, so all is well.




回答3:


The meaning of a zero-width assertion is an expression that consumes zero characters while matching. For example, in this example,

"foresight".sub(/sight/, 'ee')

what is matched is

foresight
    ^^^^^

and thus the result would be

foreee

However, in this example,

"foresight".sub(/(?<=s)ight/, 'ee')

what is matched is

foresight
     ^^^^

and therefore the result would be

foresee

Another example of a zero-width assertion is the word-boundary character, \b. For example, to match a complete word, you might try surrounding the word with spaces, e.g.

"flight light plight".sub(/\slight\s/, 'dark')

to get

flightdarkplight

But you see how matching the spaces removes it during substitution? Using a word boundary gets around this problem:

"flight light plight".sub(/\blight\b/, 'dark')

The \b matches the beginning or end of a word, but does not actually match a character: it's zero-width.

Maybe the most succinct answer to your question is this: Lookahead and lookbehind assertions are one kind of zero-width assertions. All lookahead and lookbehind assertions are zero-width assertions.


Here are explanations of your examples:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

Above, you're saying, "Match where the next character is not an s, and then an i." This is always true for an i, since an i is never an s, so the substitution succeeds.

irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"

Above, you're saying, "Match where the next character is an s, and then an i." This is never true, since an i is never an s, so the substitution fails.

irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"

Above, already explained. (This is the correct one.)

irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Above, should be clear by now. In this case, "firefight" would substitute to "firefee", but not "foresight" to "foresee".



来源:https://stackoverflow.com/questions/14387631/how-the-look-ahead-and-look-behind-concept-supports-such-zero-width-assertions-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!