Python string processing optimization

吃可爱长大的小学妹 提交于 2019-11-26 21:04:57

问题


So lately I've been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, and searching them for strings from some array ( this array can have as many as 1000 strings in it). The problem here is that i have to find a specific occurrence of the string, and the string may appear unlimited number of times in that file. Also, some decoding and encoding is required, which additionally slows the script down. Code looks something like this:

strings = [a for a in open('file.txt')]

with open("er.txt", "r") as f:
    for chunk in f:
        for s in strings
            #do search, trimming, stripping ..

My question here is: Is there a way to optimize this? I tried multiprocessing, but it helps little ( or at least the way i implemented it ) The problem here is that these chunk operations aren't independent and strings list may be altered during one of them. Any optimization would help (string search algorithms, file reading etc.) I did as much as i could regarding loop breaking, but it still runs pretty slow.


回答1:


If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

Example:

import mmap
import re

pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # PROT_READ only on *nix as the file is not writable
    for match in pattern.finditer(mm):
        # process match
        print("The answer is {}".format(match.group(1).decode('ascii')))

    mm.close()

Now, if the datafile.txt contained the text:

the ultimate answer is 42

somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:

The answer is 42

Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.


As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.




回答2:


Think about calling an external process (grep and the like) to speed up processing and reducing the data volume you have to handle within Python.

Another route to go would be to filter or pre-filter your data with a compiled regex, since then your inner loop uses the optimized code of the standard library.

You could also try Cython or similar for the hot inner loops, see e.g. https://books.google.de/books?id=QSsOBQAAQBAJ&dq=high+perf+python&hl=en for details on that.



来源:https://stackoverflow.com/questions/28643919/python-string-processing-optimization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!