Python string processing optimization

前端 未结 2 2030
再見小時候
再見小時候 2020-12-05 22:23

So lately I\'ve been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, a

相关标签:
2条回答
  • 2020-12-05 22:51

    If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

    The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

    Example:

    import mmap
    import re
    
    pattern = re.compile(b'the ultimate answer is ([0-9]+)')
    with open("datafile.txt", "rb") as f:
        # memory-map the file, size 0 means whole file
        mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    
        # PROT_READ only on *nix as the file is not writable
        for match in pattern.finditer(mm):
            # process match
            print("The answer is {}".format(match.group(1).decode('utf8')))
    
        mm.close()
    

    Now, if the datafile.txt contained the text:

    the ultimate answer is 42
    

    somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:

    The answer is 42
    

    Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.


    As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.

    0 讨论(0)
  • 2020-12-05 23:07

    Think about calling an external process (grep and the like) to speed up processing and reducing the data volume you have to handle within Python.

    Another route to go would be to filter or pre-filter your data with a compiled regex, since then your inner loop uses the optimized code of the standard library.

    You could also try Cython or similar for the hot inner loops, see e.g. https://books.google.de/books?id=QSsOBQAAQBAJ&dq=high+perf+python&hl=en for details on that.

    0 讨论(0)
提交回复
热议问题