Processing Large Files in Python [ 1000 GB or More]

后端未结

关注

 8  978

佛祖请我去吃肉

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow

相关标签:

8条回答

遥遥无期

2020-12-15 07:22
Had you considered indexing your file? The way search engine works is by creating a mapping from words to the location they are in the file. Say if you have this file:
```
Foo bar baz dar. Dar bar haa.
```
You create an index that looks like this:
```
{
    "foo": {0},
    "bar": {4, 21},
    "baz": {8},
    "dar": {12, 17},
    "haa": {25},
}
```
A hashtable index can be looked up in O(1); so it's freaking fast.

And someone searches for the query "bar baz" you first break the query into its constituent words: ["bar", "baz"] and you then found {4, 21}, {8}; then you use this to jump out right to the places where the queried text could possible exists.

There are out of the box solutions for indexed search engines as well; for example Solr or ElasticSearch.
0 讨论(0)
发布评论:

提交评论
- 加载中...

我在风中等你

2020-12-15 07:25

We're talking about a simple count of a specific substring within a rather large data stream. The task is nearly certainly I/O bound, but very easily parallelised. The first layer is the raw read speed; we can choose to reduce the read amount by using compression, or distribute the transfer rate by storing the data in multiple places. Then we have the search itself; substring searches are a well known problem, again I/O limited. If the data set comes from a single disk pretty much any optimisation is moot, as there's no way that disk beats a single core in speed.

Assuming we do have chunks, which might for instance be the separate blocks of a bzip2 file (if we use a threaded decompressor), stripes in a RAID, or distributed nodes, we have much to gain from processing them individually. Each chunk is searched for needle, then joints can be formed by taking len(needle)-1 from the end of one chunk and beginning of the next, and searching within those.

A quick benchmark demonstrates that the regular expression state machines operate faster than the usual in operator:

>>> timeit.timeit("x.search(s)", "s='a'*500000; import re; x=re.compile('foobar')", number=20000)
17.146117210388184
>>> timeit.timeit("'foobar' in s", "s='a'*500000", number=20000)
24.263535976409912
>>> timeit.timeit("n in s", "s='a'*500000; n='foobar'", number=20000)
21.562405109405518

Another step of optimization we can perform, given that we have the data in a file, is to mmap it instead of using the usual read operations. This permits the operating system to use the disk buffers directly. It also allows the kernel to satisfy multiple read requests in arbitrary order without making extra system calls, which lets us exploit things like an underlying RAID when operating in multiple threads.

Here's a quickly tossed together prototype. A few things could obviously be improved, such as distributing the chunk processes if we have a multinode cluster, doing the tail+head check by passing one to the neighboring worker (an order which is not known in this implementation) instead of sending both to a special worker, and implementing an interthread limited queue (pipe) class instead of matching semaphores. It would probably also make sense to move the worker threads outside of the main thread function, since the main thread keeps altering its locals.

from mmap import mmap, ALLOCATIONGRANULARITY, ACCESS_READ
from re import compile, escape
from threading import Semaphore, Thread
from collections import deque

def search(needle, filename):
    # Might want chunksize=RAID block size, threads
    chunksize=ALLOCATIONGRANULARITY*1024
    threads=32
    # Read chunk allowance
    allocchunks=Semaphore(threads)  # should maybe be larger
    chunkqueue=deque()   # Chunks mapped, read by workers
    chunksready=Semaphore(0)
    headtails=Semaphore(0)   # edges between chunks into special worker
    headtailq=deque()
    sumq=deque()     # worker final results

    # Note: although we do push and pop at differing ends of the
    # queues, we do not actually need to preserve ordering. 

    def headtailthread():
        # Since head+tail is 2*len(needle)-2 long, 
        # it cannot contain more than one needle
        htsum=0
        matcher=compile(escape(needle))
        heads={}
        tails={}
        while True:
            headtails.acquire()
            try:
                pos,head,tail=headtailq.popleft()
            except IndexError:
                break  # semaphore signaled without data, end of stream
            try:
                prevtail=tails.pop(pos-chunksize)
                if matcher.search(prevtail+head):
                    htsum+=1
            except KeyError:
                heads[pos]=head
            try:
                nexthead=heads.pop(pos+chunksize)
                if matcher.search(tail+nexthead):
                    htsum+=1
            except KeyError:
                tails[pos]=tail
        # No need to check spill tail and head as they are shorter than needle
        sumq.append(htsum)

    def chunkthread():
        threadsum=0
        # escape special characters to achieve fixed string search
        matcher=compile(escape(needle))
        borderlen=len(needle)-1
        while True:
            chunksready.acquire()
            try:
                pos,chunk=chunkqueue.popleft()
            except IndexError:   # End of stream
                break
            # Let the re module do the heavy lifting
            threadsum+=len(matcher.findall(chunk))
            if borderlen>0:
                # Extract the end pieces for checking borders
                head=chunk[:borderlen]
                tail=chunk[-borderlen:]
                headtailq.append((pos,head,tail))
                headtails.release()
            chunk.close()
            allocchunks.release()  # let main thread allocate another chunk
        sumq.append(threadsum)

    with infile=open(filename,'rb'):
        htt=Thread(target=headtailthread)
        htt.start()
        chunkthreads=[]
        for i in range(threads):
            t=Thread(target=chunkthread)
            t.start()
            chunkthreads.append(t)
        pos=0
        fileno=infile.fileno()
        while True:
            allocchunks.acquire()
            chunk=mmap(fileno, chunksize, access=ACCESS_READ, offset=pos)
            chunkqueue.append((pos,chunk))
            chunksready.release()
            pos+=chunksize
            if pos>chunk.size():   # Last chunk of file?
                break
        # File ended, finish all chunks
        for t in chunkthreads:
            chunksready.release()   # wake thread so it finishes
        for t in chunkthreads:
            t.join()    # wait for thread to finish
        headtails.release()     # post event to finish border checker
        htt.join()
        # All threads finished, collect our sum
        return sum(sumq)

if __name__=="__main__":
    from sys import argv
    print "Found string %d times"%search(*argv[1:])

Also, modifying the whole thing to use some mapreduce routine (map chunks to counts, heads and tails, reduce by summing counts and checking tail+head parts) is left as an exercise.

Edit: Since it seems this search will be repeated with varying needles, an index would be much faster, being able to skip searches of sections that are known not to match. One possibility is making a map of which blocks contain any occurence of various n-grams (accounting for the block borders by allowing the ngram to overlap into the next); those maps can then be combined to find more complex conditions, before the blocks of original data need to be loaded. There are certainly databases to do this; look for full text search engines.

0 讨论(0)

上一页 1 2