I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minim
It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.
Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :
paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
head -1000000 > largefile.txt
After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)
1000000 0.333 0.000 0.333 0.000 {method 'split' of 'str' objects}
1000000 0.154 0.000 0.154 0.000 {method 'append' of 'list' objects}
So I factored the split into another function and use a pool to distribute the work of splitting the fields :
import sys
import collections
import multiprocessing as mp
d = collections.defaultdict(list)
def split(l):
return l.split()
pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
d[keys[0]].append(keys[1:])
Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :
$ time python process.py smallfile.txt
real 0m7.170s
user 0m6.884s
sys 0m0.260s
versus the parallel version :
$ time python process-mp.py smallfile.txt
real 0m16.655s
user 0m24.688s
sys 0m1.380s
Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.
To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.