Read large file in parallel?

后端未结

关注

 6  1892

被撕碎了的回忆 2020-12-05 04:50

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minim

6条回答

感情败类 (楼主)

2020-12-05 05:37
It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :
```
paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
  head -1000000 > largefile.txt
```
After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)
```
1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}
```
So I factored the split into another function and use a pool to distribute the work of splitting the fields :
```
import sys
import collections
import multiprocessing as mp

d = collections.defaultdict(list)

def split(l):
    return l.split()

pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
    d[keys[0]].append(keys[1:])
```
Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :
```
$ time python process.py smallfile.txt 
real    0m7.170s
user    0m6.884s
sys     0m0.260s
```
versus the parallel version :
```
$ time python process-mp.py smallfile.txt 
real    0m16.655s
user    0m24.688s
sys     0m1.380s
```
Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...