Python: regex match across file chunk boundaries

两盒软妹~` 提交于 2019-12-05 08:09:12

Assuming this is your exact problem you could probably just adapt your regex and read line by line (which won't load the full file into memory):

import re
matches = []
identifier_pattern = re.compile(r'Identifier: (.*?)$')
with open('huge_file') as f:
    for line in f:
        matches += re.findall(identifier_pattern, line)

print("matches", matches)

You can control chunk forming and have it close to 1024 * 1024 * 1024, in that case you avoid missing parts:

import re


identifier_pattern = re.compile(r'Identifier: (.*?)\n')
counter = 1024 * 1024 * 1024
data_chunk = ''
with open('huge_file', 'r') as f:
    for line in f:
        data_chunk = '{}{}'.format(data_chunk, line)
        if len(data_chunk) > counter:
            m = re.findall(identifier_pattern, data_chunk)
            print m.group()
            data_chunk = ''
    # Analyse last chunk of data
    m = re.findall(identifier_pattern, data_chunk)
    print m.group()

Alternativelly, you can go two times over same file with different starting point of read (first time from: 0, second time from max length of matched string collected during first iteration), store results as dictionaries, where key=[start position of matched string in file], that position would be same for each iteration, so it shall not be a problem to merge results, however I think it would be more accurate to do merge by start position and length of matched string.

Good Luck !

If the file is line-based, the file object is a lazy generator of lines, it will load the file into memory line by line (in chunks), based on that, you can use:

import re
matches = []
for line in open('huge_file'):
    matches += re.findall("Identifier:\s(.*?)$", line)

I have a solution very similar to Jack's answer:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        m.extend(identifier_pattern.findall(line))

You could use a another part of the regex API to have the same result:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        pattern_found = identifier_pattern.search(line)
        if pattern_found:
            value_found = pattern_found.group(0)
            m.append(value_found)

Which we could simplify using a generator expression and a list comprehension

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

with open('huge_file', 'r') as f:
    patterns_found = (identifier.search(line) for line in f)
    m = [pattern_found.group(0) 
         for pattern_found in patterns_found if pattern_found]

If the matched result string's length is known, the easiest way I think is to cache the last chunk's bytes around the boundary.

Suppose the result's length is 3, keep the last 2 chars of the last chunk, then add it to the new chunk for matching.

Pseudo-code:

regex  pattern
string boundary
int    match_result_len

for chunk in chunks:
    match(boundary + chunk, pattern)
    boundary = chunk[-(match_result_len - 1):]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!