Python: consecutive lines between matches similar to awk

后端 未结 3 770
青春惊慌失措
青春惊慌失措 2021-01-15 06:30

Given:

  • A multiline string string (already read from a file file)
  • Two patterns pattern1 and pattern2
3条回答
  •  醉话见心
    2021-01-15 07:04

    In awk the /start/, /end/ range regex prints the entire line that the /start/is found in up to and including the entire line where the /end/ pattern is found. It is a useful construct and has been copied by Perl, sed, Ruby and others.

    To do a range operator in Python, write a class that keeps track of the state of the previous call to the start operator until the end operator. We can use a regex (as awk does) or this can be trivially modified to anything returning a True or False status for a line of data.

    Given your example file, you can do:

    import re
    
    class FlipFlop: 
        ''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
        def __init__(self, start_pattern, end_pattern):
            self.patterns = start_pattern, end_pattern
            self.state = False
        def __call__(self, st):
            ms=[e.search(st) for e in self.patterns]
            if all(m for m in ms):
                self.state = False
                return True
            rtr=True if self.state else False
            if ms[self.state]:
                self.state = not self.state
            return self.state or rtr
    
    with open('/tmp/file') as f:
        ff=FlipFlop(re.compile('b bb'), re.compile('d dd'))
        print ''.join(line if ff(line) else "" for line in f)
    

    Prints:

    bbb bb b
    ccc cc c
    ffffd dd d
    

    That retains a line-by-line file read with the flexibility of /start/,/end/ regex seen in other languages. Of course, you can do the same approach for a multiline string (assumed be named s):

    ''.join(line+"\n" if ff(line) else "" for line in s.splitlines())
    

    Idiomatically, in awk, you can get the same result as a flipflop using a flag:

    $ awk '/b bb/{flag=1} flag{print $0} /d dd/{flag=0}' file
    

    You can replicate that in Python as well (with more words):

    flag=False    
    with open('file') as f:
        for line in f:
            if re.search(r'b bb', line):
                flag=True
            if flag:
                print(line.rstrip())
            if re.search(r'd dd', line):
                flag=False  
    

    Which can also be used with in memory string.

    Or, you can use a multi-line regex:

    with open('/tmp/file') as f:
        print ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', f.read(), re.M))
    

    Demo and explanation

    But that requires reading the entire file into memory. Since you state the string has been read into memory, that is probably easiest in this case:

    ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', s, re.M))
    

提交回复
热议问题