Read large file in parallel?

后端 未结 6 1892
被撕碎了的回忆
被撕碎了的回忆 2020-12-05 04:50

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minim

6条回答
  •  感情败类
    2020-12-05 05:37

    It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

    Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :

    paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
      head -1000000 > largefile.txt
    

    After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)

    1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
    1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}
    

    So I factored the split into another function and use a pool to distribute the work of splitting the fields :

    import sys
    import collections
    import multiprocessing as mp
    
    d = collections.defaultdict(list)
    
    def split(l):
        return l.split()
    
    pool = mp.Pool(processes=4)
    for keys in pool.map(split, open(sys.argv[1])):
        d[keys[0]].append(keys[1:])
    

    Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :

    $ time python process.py smallfile.txt 
    real    0m7.170s
    user    0m6.884s
    sys     0m0.260s
    

    versus the parallel version :

    $ time python process-mp.py smallfile.txt 
    real    0m16.655s
    user    0m24.688s
    sys     0m1.380s
    

    Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

    To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.

提交回复
热议问题