How to grep lines between two patterns in a big file with python

问题

I have a very big file, like this:

[PATTERN1]
line1
line2
line3 
...
...
[END PATTERN]
[PATTERN2]
line1 
line2
...
...
[END PATTERN]

I need to extract in another file, lines between a variable starter pattern [PATTERN1] and another define pattern [END PATTERN], only for some specific starter pattern.
For example:

[PATTERN2]
line1 
line2
...
...
[END PATTERN]

I already do the same thing, with a smaller file, using this code:

FILE=open('myfile').readlines()

newfile=[]
for n in name_list:
    A = FILE[[s for s,name in enumerate(FILE) if n in name][0]:]
    B = A[:[e+1 for e,end in enumerate(A) if 'END PATTERN' in end][0]]
    newfile.append(B)

Where 'name_list' is a list with the specific starter patterns that I need.

It works!! but I suppose there is a better way to do this working with big files, without using the .readlines() command.
Anyone can help me?

thanks a lot!

回答1:

Use something like

import re

START_PATTERN = '^START-PATTERN$'
END_PATTERN = '^END-PATTERN$'

with open('myfile') as file:
    match = False
    newfile = None

    for line in file:
        if re.match(START_PATTERN, line):
            match = True
            newfile = open('my_new_file.txt', 'w')
            continue
        elif re.match(END_PATTERN, line):
            match = False
            newfile.close()
            continue
        elif match:
            newfile.write(line)
            newfile.write('\n')

This will iterate the file without reading it all into memory. It also writes directly to your new file, rather than appending to a list in memory. If your source is large enough that too may become an issue.

Obviously there are numerous modifications you may need to make to this code; perhaps a regex pattern is not required to match a start/end line, in which case replace it with something like if 'xyz' in line.

回答2:

Consider:

# hi
# there
# begin
# need
# this
# stuff
# end
# skip
# this

with open(__file__) as fp:
    for line in iter(fp.readline, '# begin\n'):
        pass
    for line in iter(fp.readline, '# end\n'):
        print line

prints "need this stuff"

More flexible (e.g. to allow re pattern matching) is to use itertools drop- and takewhile:

with open(__file__) as fp:
    result = list(itertools.takewhile(lambda x: 'end' not in x, 
        itertools.dropwhile(lambda x: 'begin' not in x, fp)))

回答3:

I think this does the same thing your code does:

FILE=open('myfile').readlines()

newfile=[]

pattern = None
for line in FILE:
    if line[0] == "[" and line[-1] == "]":
        pattern = line[1:-1]
        if pattern == "END PATTERN":
            pattern = None
        continue
    elif pattern is not None and pattern in name_list:
        newfile.append(line)

This way you go through all the lines only once, and fill your list as you go.

回答4:

I am kind of a new python programmer so I only barely understand your solution, but it seems like there is a lot of unnecessary iteration going on. First you read in the file, then you iterate through the file once for each item in name_list. Also, I don't know if you plan to iterate over newfile later to actually write it to a file.

Here is how I would do it, though I realize it isn't the most pythonic looking solution. You'll only iterate over the file once though. (As a disclaimer, I didn't test this out.)

patterns = {'startPattern1':"endPattern1", 'startPattern2':"endPattern2", 'startPattern3':"endPattern3"}

fileIn = open(filenameIn, 'r')
fileOut = open(filenameOut, 'w')
targetEndPattern = None

for line in fileIn:
   if targetEndPattern is not None:
       if line == targetEndPattern:
           targetEndPattern = None
       else:
           fileOut.write(line + "\n")
   elif line in patterns:
       targetEndPattern = patterns[line]

EDIT: If you are expecting the patterns in a certain order, then this solution would have to be revised. I wrote this under the assumption that the order of the patterns doesn't matter but each start pattern matches a specific end pattern.

回答5:

i would go with a generator-based solution

#!/usr/bin/env python    
start_patterns = ('PATTERN1', 'PATTERN2')
end_patterns = ('END PATTERN')

def section_with_bounds(gen):
  section_in_play = False
  for line in gen:
    if line.startswith(start_patterns):
      section_in_play = True
    if section_in_play:
      yield line
    if line.startswith(end_patterns):
      section_in_play = False

with open("text.t2") as f:
  gen = section_with_bounds(f)
  for line in gen:
    print line

来源：https://stackoverflow.com/questions/11156259/how-to-grep-lines-between-two-patterns-in-a-big-file-with-python

标签

python

grep

lines