Python string processing optimization

前端未结

关注

 2  2030

So lately I\'ve been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, a

相关标签:

2条回答

离开以前

2020-12-05 22:51
If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

Example:
```
import mmap
import re

pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # PROT_READ only on *nix as the file is not writable
    for match in pattern.finditer(mm):
        # process match
        print("The answer is {}".format(match.group(1).decode('utf8')))

    mm.close()
```
Now, if the datafile.txt contained the text:
```
the ultimate answer is 42
```
somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:
```
The answer is 42
```
Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.

As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-05 23:07

Think about calling an external process (grep and the like) to speed up processing and reducing the data volume you have to handle within Python.

Another route to go would be to filter or pre-filter your data with a compiled regex, since then your inner loop uses the optimized code of the standard library.

You could also try Cython or similar for the hot inner loops, see e.g. https://books.google.de/books?id=QSsOBQAAQBAJ&dq=high+perf+python&hl=en for details on that.

0 讨论(0)
发布评论:

提交评论
- 加载中...