Python regular expression using the OR operator

社会主义新天地 提交于 2019-12-11 11:13:12

问题


I am trying to parse a large sample of text files with regular expressions (RE). I am trying to extract from these files the part of the text which contains 'vu' and ends with a newline '\n'.

Patterns differ from one file to another, so I tried to look for combinations of RE in my files using the OR operator. However, I did not find a way to automate my code so that the re.findall() function looks for a combination of RE.

Here is an example of how I tried to tackle this issue, but apparently I still can not evaluate both my regular expressions and the OR operator in re.findall():

import re

def series2string(myserie) :
    myserie2 = ' or '.join(serie for serie in myserie)
    return myserie2

def expression(pattern, mystring) : 
    x = re.findall(pattern, mystring)
    if len(x)>0:
        return 1
    else:
        return 0

#text example
text = "\n\n    (troisième chambre)\n    i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"

#expressions to look out
pattern1 = '^\s*vu.*\n'
pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'

pattern = [pattern1, pattern2]
pattern = series2string(pattern)

expression(pattern, text)

Note : I circumvented this problem by looking for each pattern in a for loop but my code would run faster if I could just use re.findall() once.


回答1:


Python regular expressions uses the | operator for alternation.

def series2string(myserie) :
    myserie2 = '|'.join(serie for serie in myserie)
    myserie2 = '(' + myserie2 + ')'
    return myserie2

More information: https://docs.python.org/3/library/re.html


The individual patterns look really messy, so I don't know what is a mistake, and what is intentional. I am guessing you are looking for the word "vu" in a few different contexts.

  1. Always use Python raw strings for regular expressions, prefixed with r (r'pattern here'). It allows you to use \ in a pattern without python trying to interpret it as a string escape. It is passed directly to the regex engine. (ref)
  2. Use \s to match white-space (spaces and line-breaks).
  3. Since you already have several alternative patterns, don't make ( and ) optional. It can result in catastrophic backtracking, which can make matching large strings really slow.
    \(?\(
    \)?\)
  4. {1} doesn't do anything. It just repeats the previous sub-pattern once, which is the same as not specifying anything.
  5. \br is invalid. It is interpreted as \b (ASCII bell-character) + the letter r.
  6. You have a quote character (') at the beginning of your text-string. Either you intend ^ to match the start of any line, or the ' is a copy/paste error.
  7. Some errors when combining the patterns:

    pattern = [pattern1, pattern2, pattern3, pattern4]
    pattern = series2string(pattern)
    
    expression(re.compile(pattern), text)
    



回答2:


Thank you for your tips. My regular expressions were a little clumsy in my first post (I changed them hoping the question would be more understandable). I managed to capture the OR operator '|' thanks to 're.compile' and the code works fine!

import re

def series2string(myserie) :
    myserie2 = '|'.join(serie for serie in myserie)
    return myserie2

def expression(pattern, mystring) : 
    x = re.findall(pattern, mystring)
    if len(x)>0:
        return 1
    else:
        return 0

#text example
text = "\n\n    (troisième chambre)\n    i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"

#expressions to look out
pattern1 = r'^\s*vu.*\n'
pattern2 = r'^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'

pattern = [pattern1, pattern2]
pattern = series2string(pattern)

expression(re.compile(pattern), text)


来源:https://stackoverflow.com/questions/32690450/python-regular-expression-using-the-or-operator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!