Huge plain-text data file
I read a huge file in chunks using python. Then I apply a regex on that chunk. Based on an identifier tag, I want to extract the corresponding value. Due to the chunk size, data is missing at the chunk boundaries.
Requirements:
- The file must be read in chunks.
- The chunk sizes must be smaller than or equal to 1 GiB.
Python code example
identifier_pattern = re.compile(r'Identifier: (.*?)\n')
with open('huge_file', 'r') as f:
data_chunk = f.read(1024*1024*1024)
m = re.findall(identifier_pattern, data_chunk)
Chunk data examples
Good: number of tags equivalent to number of values
Identifier: value
Identifier: value
Identifier: value
Identifier: value
Due to the chunk size, you get varying boundary issues as listed below. The third identifier returns an incomplete value, "v" instead of "value". The next chunk contains "alue". This causes missing data after parsing.
Bad: identifier value incomplete
Identifier: value
Identifier: value
Identifier: v
How do you solve chunk boundary issues like this?
Assuming this is your exact problem you could probably just adapt your regex and read line by line (which won't load the full file into memory):
import re
matches = []
identifier_pattern = re.compile(r'Identifier: (.*?)$')
with open('huge_file') as f:
for line in f:
matches += re.findall(identifier_pattern, line)
print("matches", matches)
You can control chunk forming and have it close to 1024 * 1024 * 1024, in that case you avoid missing parts:
import re
identifier_pattern = re.compile(r'Identifier: (.*?)\n')
counter = 1024 * 1024 * 1024
data_chunk = ''
with open('huge_file', 'r') as f:
for line in f:
data_chunk = '{}{}'.format(data_chunk, line)
if len(data_chunk) > counter:
m = re.findall(identifier_pattern, data_chunk)
print m.group()
data_chunk = ''
# Analyse last chunk of data
m = re.findall(identifier_pattern, data_chunk)
print m.group()
Alternativelly, you can go two times over same file with different starting point of read
(first time from: 0, second time from max length of matched string collected during first iteration), store results as dictionaries, where key=[start position of matched string in file]
, that position would be same for each iteration, so it shall not be a problem to merge results, however I think it would be more accurate to do merge by start position and length of matched string.
Good Luck !
If the file is line-based, the file
object is a lazy generator of lines, it will load the file into memory line by line (in chunks), based on that, you can use:
import re
matches = []
for line in open('huge_file'):
matches += re.findall("Identifier:\s(.*?)$", line)
I have a solution very similar to Jack's answer:
#!/usr/bin/env python3
import re
identifier_pattern = re.compile(r'Identifier: (.*)$')
m = []
with open('huge_file', 'r') as f:
for line in f:
m.extend(identifier_pattern.findall(line))
You could use a another part of the regex API to have the same result:
#!/usr/bin/env python3
import re
identifier_pattern = re.compile(r'Identifier: (.*)$')
m = []
with open('huge_file', 'r') as f:
for line in f:
pattern_found = identifier_pattern.search(line)
if pattern_found:
value_found = pattern_found.group(0)
m.append(value_found)
Which we could simplify using a generator expression and a list comprehension
#!/usr/bin/env python3
import re
identifier_pattern = re.compile(r'Identifier: (.*)$')
with open('huge_file', 'r') as f:
patterns_found = (identifier.search(line) for line in f)
m = [pattern_found.group(0)
for pattern_found in patterns_found if pattern_found]
If the matched result string's length is known, the easiest way I think is to cache the last chunk's bytes around the boundary.
Suppose the result's length is 3, keep the last 2 chars of the last chunk, then add it to the new chunk for matching.
Pseudo-code:
regex pattern
string boundary
int match_result_len
for chunk in chunks:
match(boundary + chunk, pattern)
boundary = chunk[-(match_result_len - 1):]
来源:https://stackoverflow.com/questions/44212183/python-regex-match-across-file-chunk-boundaries