Sequence words with regex

时间秒杀一切 提交于 2019-12-02 06:50:29

问题


I search the sequence:

nunca[ADV+NEG+CIRC] más[ADV+comp+CIRC] compraré[V+H_PREDICAT_ACTION]

and

nunca más compraré

My script:

corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] 
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] 
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"

part1 = re.findall(r"(\w+)\[ADV\+NEG.*?\]", corpus)
part2 = re.findall(r"(\w+)\[ADV+comp+PADV.*?\]", corpus)
part3 = re.findall(r"(\w+)\[V\+H_PREDICAT.*?\]", corpus)
print(part1 + part2 + part3)

Result:

[]


回答1:


If the searched substrings are in arbitrary order - use the following: re.findall() approach:

corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"

result = ' '.join(i[0] for i in re.findall(r'(\w+)\[[^][]*(AD|V)\+[^][]*\]', corpus, re.M | re.UNICODE))
print(result)

The output:

nunca más compraré

regex pattern explanation:

  • (\w+) - match a word(alphanumeric sequence) (for ex. nunca). Placed into the first captured group (...)

  • \[ - match opening square bracket [ literally

  • [^][]* - match one or many characters except square brackets ][

  • (AD|V) - alternation group, match either AD or V key

  • \] - match closing square bracket ] literally

for ex. \[[^][]*(AD|V)\+[^][]*\] will match [ADV+NEG+CIRC]

----------

If the order of sequences is strict - use re.sub() function instead re.findall() to remove all parenthetical sequences:

corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"

result = re.sub(r'\[[^][]+\]', '', corpus, re.M | re.UNICODE)
print(result)

The output:

Me temo que buscare otras opciones esta nunca más compraré

To extract the last 3 words:

print(' '.join(result.split()[-3:]))    # nunca más compraré


来源:https://stackoverflow.com/questions/46171355/sequence-words-with-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!