So lately I\'ve been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, a
If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes
(or str
in Python 2) obtained by file.read()
would; then such a mmap
object would be searchable by a str
regular expression (Python 2), or bytes
regular expression (Python 3).
The mmap
is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.
Example:
import mmap
import re
pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
# PROT_READ only on *nix as the file is not writable
for match in pattern.finditer(mm):
# process match
print("The answer is {}".format(match.group(1).decode('utf8')))
mm.close()
Now, if the datafile.txt
contained the text:
the ultimate answer is 42
somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:
The answer is 42
Notice that pattern.finditer also accepts start
and end
parameters that can used to limit the range where the match is attempted.
As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.