问题
I search the sequence:
nunca[ADV+NEG+CIRC] más[ADV+comp+CIRC] compraré[V+H_PREDICAT_ACTION]
and
nunca más compraré
My script:
corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown]
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC]
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"
part1 = re.findall(r"(\w+)\[ADV\+NEG.*?\]", corpus)
part2 = re.findall(r"(\w+)\[ADV+comp+PADV.*?\]", corpus)
part3 = re.findall(r"(\w+)\[V\+H_PREDICAT.*?\]", corpus)
print(part1 + part2 + part3)
Result:
[]
回答1:
If the searched substrings are in arbitrary order - use the following: re.findall()
approach:
corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"
result = ' '.join(i[0] for i in re.findall(r'(\w+)\[[^][]*(AD|V)\+[^][]*\]', corpus, re.M | re.UNICODE))
print(result)
The output:
nunca más compraré
regex pattern explanation:
(\w+)
- match a word(alphanumeric sequence) (for ex.nunca
). Placed into the first captured group(...)
\[
- match opening square bracket[
literally[^][]*
- match one or many characters except square brackets][
(AD|V)
- alternation group, match eitherAD
orV
key\]
- match closing square bracket]
literally
for ex. \[[^][]*(AD|V)\+[^][]*\]
will match [ADV+NEG+CIRC]
----------
If the order of sequences is strict - use re.sub()
function instead re.findall()
to remove all parenthetical sequences:
corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"
result = re.sub(r'\[[^][]+\]', '', corpus, re.M | re.UNICODE)
print(result)
The output:
Me temo que buscare otras opciones esta nunca más compraré
To extract the last 3 words:
print(' '.join(result.split()[-3:])) # nunca más compraré
来源:https://stackoverflow.com/questions/46171355/sequence-words-with-regex