Sequence words with regex

前端 未结 1 1005
情歌与酒
情歌与酒 2021-01-25 17:11

I search the sequence:

nunca[ADV+NEG+CIRC] más[ADV+comp+CIRC] compraré[V+H_PREDICAT_ACTION]

and

nunca más co

相关标签:
1条回答
  • 2021-01-25 17:54

    If the searched substrings are in arbitrary order - use the following: re.findall() approach:

    corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
    otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
    más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"
    
    result = ' '.join(i[0] for i in re.findall(r'(\w+)\[[^][]*(AD|V)\+[^][]*\]', corpus, re.M | re.UNICODE))
    print(result)
    

    The output:

    nunca más compraré
    

    regex pattern explanation:

    • (\w+) - match a word(alphanumeric sequence) (for ex. nunca). Placed into the first captured group (...)

    • \[ - match opening square bracket [ literally

    • [^][]* - match one or many characters except square brackets ][

    • (AD|V) - alternation group, match either AD or V key

    • \] - match closing square bracket ] literally

    for ex. \[[^][]*(AD|V)\+[^][]*\] will match [ADV+NEG+CIRC]

    ----------

    If the order of sequences is strict - use re.sub() function instead re.findall() to remove all parenthetical sequences:

    corpus = "Me[Unknown] temo[Unknown] que[Unknown] buscare[Unknown] \
    otras[Unknown] opciones[Unknown] esta[Unknown] nunca[ADV+NEG+CIRC] \
    más[ADV+comp+PADV+H_CIRCONSTANT_QUANTITE] compraré[V+H_PREDICAT_ACTION]"
    
    result = re.sub(r'\[[^][]+\]', '', corpus, re.M | re.UNICODE)
    print(result)
    

    The output:

    Me temo que buscare otras opciones esta nunca más compraré
    

    To extract the last 3 words:

    print(' '.join(result.split()[-3:]))    # nunca más compraré
    
    0 讨论(0)
提交回复
热议问题