问题
In Python, how do you capture a group within a non-capturing group? Put in another way, how do you repeat a non-capturing sub-pattern that contains a capturing group?
An example of this would be to capture all of the package names on an import string. E.g. the string:
import pandas, os, sys
Would return 'pandas', 'os', and 'sys'. The following pattern captures the first package and gets up to the second package:
import\s+([a-zA-Z0=9]*),*\s*
From here, I would like to repeat the sub-pattern that captures the group and matches the following characters, i.e.([a-zA-Z0=9]*),*\s*
. When I surround this sub-pattern with a non-capturing group and repeat it:
import\s+(?:([a-zA-Z0=9]*),*\s*)*
It no longer captures the group inside.
回答1:
Your question is phrased strictly about regex, but if you're willing to use a recursive descent parser (e.g., pyparsing), many things that require expertise in regex, become very simple.
E.g., here what you're asking becomes
from pyparsing import *
p = Suppress(Literal('import')) + commaSeparatedList
>>> p.parseString('import pandas, os, sys').asList()
['pandas', 'os', 'sys']
>>> p.parseString('import pandas, os').asList()
['pandas', 'os']
It might be a matter of personal taste, but to me,
Suppress(Literal('import')) + commaSeparatedList
is also more intuitive than a regex.
回答2:
A repeated capturing group will only capture the last iteration. This is why you need to restructure your regex to work with re.findall
.
\s*
(?:
(?:^from\s+
( # Base (from (base) import ...)
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)\s+import\s+
)
|
(?:^import\s|,)\s*
)
( # Name of imported module (import (this))
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)
(?:
\s+as\s+
( # Variable module is imported into (import foo as bar)
(?:[a-zA-Z_][a-zA-Z_0-9]* # Variable name
(?:\.[a-zA-Z_][a-zA-Z_0-9]*)* # Attribute (.attr)
)
)
)?
\s*
(?=,|$) # Ensure there is another thing being imported or it is the end of string
Try it on regex101.com
Capture group 0 will be the Base
, capture group 1 will be (What you're after) the name of the imported module, and capture group 2 will be the variable the module is in (from (group 0) import (group 1) as (group 2)
)
import re
regex = r"\s*(?:(?:^from\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))\s+import\s+)|(?:^import\s|,)\s*)((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))(?:\s+as\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*)))?\s*(?=,|$)"
print(re.findall(regex, "import pandas, os, sys"))
[('', 'pandas', ''), ('', 'os', ''), ('', 'sys', '')]
You can remove the other two capturing groups if you don't care for them.
回答3:
You can use your import\s+(?:([a-zA-Z0-9=]+),*\s*)*
regex (I just fixed the 0-9
range to match any digit and included =
to the end) and access the Group 1 capture stack using PyPi regex module:
>>> import regex
>>> s = 'import pandas, os, sys'
>>> rx = regex.compile(r'^import\s+(?:([a-zA-Z0-9=]+),*\s*)*$')
>>> print([x.captures(1) for x in rx.finditer(s)])
[['pandas', 'os', 'sys']]
来源:https://stackoverflow.com/questions/39416911/regex-capturing-group-within-non-capturing-group