Python string processing optimization

前端未结

关注

 2  2033

再見小時候 2020-12-05 22:23

So lately I\'ve been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, a

2条回答

离开以前 (楼主)

2020-12-05 22:51
If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

Example:
```
import mmap
import re

pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # PROT_READ only on *nix as the file is not writable
    for match in pattern.finditer(mm):
        # process match
        print("The answer is {}".format(match.group(1).decode('utf8')))

    mm.close()
```
Now, if the datafile.txt contained the text:
```
the ultimate answer is 42
```
somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:
```
The answer is 42
```
Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.

As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...