How to read a file and extract data between multiline patterns?

寵の児 提交于 2019-12-08 03:22:14

问题


I have a file from which I need to extract one piece of data, delimited by (possibly) multiline fixed patterns

some data ... [my opening pattern
is here
and can be multiline] the data 
I want to extract [my ending
pattern which can be
multiline as well] ... more data

These patterns are fixed in the sense that the content is always the same, except that it can include new lines between words.

The solution would be simple if I had the assurance that my pattern will be predictably formatted but do not.

Is there a way to match such "patterns" to a stream?

There is a question which is an almost duplicate and the answers point towards buffering the input. The difference in my case is that I know exact strings in the pattern, except that the words are possibly also delimited by a newline (so no need for \w* kind of matches)


回答1:


Are you looking for this?

>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']

UPDATE To read a large file into chunks, I suggest the following approach:

## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re

class ChunkIter:
    def __init__(self, f, delim):
        """ f: file object
        delim: regex pattern"""        
        self.f = f
        self.delim = re.compile(delim)
        self.buffer = ''
        self.part = '' # the string to return

    def read_to_delim(self):
        """Return characters up to the last delim, or None if at EOF"""

        while "delimiter not found":
            b = self.f.read(256)
            if not b: # if EOF
                self.part = None
                break
            # Continue reading to buffer
            self.buffer += b
            # Try regex split the buffer string    
            parts = self.delim.split(self.buffer)
            # If pattern is found
            if parts[:-1]:
                # Retrieve the string up to the last delim
                self.part = ''.join(parts[:-1])
                # Reset buffer string
                self.buffer = parts[-1]
                break   

        return self.part

if __name__ == '__main__':
    with open('input.txt', 'r') as f:
        chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
        while chunk.read_to_delim():
             print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)

    print 'job done.'


来源:https://stackoverflow.com/questions/35888841/how-to-read-a-file-and-extract-data-between-multiline-patterns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!