Understanding regex pattern used to find string between strings in html

那年仲夏 提交于 2019-12-12 00:27:28

问题


I have the following html file:

<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">

In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:

import re 

Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result

I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?


回答1:


The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:

.*  # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)

More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).

Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is

match one or more a's followed by an a.

This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.

If we use the regex /(a+?)a this is

reluctantly match one or more as followed by an a
or
match one or more as until we reach another a

That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.

This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:

/Dev/videos/1610110089242029/

The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.

However we have

/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"

Without reluctance, the regex will match:

1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029

It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.




回答2:


It can be explained in a simple way:

  • .: match anything (any character),
  • *: any number of times (at least zero times),
  • ?: as few times as possible (hence non-greedy).
videos/(.*?)/"

as a regular expression matches (for example)

videos/1610110089242029/"

and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.

The ? causes something like this:

videos/1610110089242029/" something else … "videos/2387423470237509/"

to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.




回答3:


The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.



来源:https://stackoverflow.com/questions/32491947/understanding-regex-pattern-used-to-find-string-between-strings-in-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!