read multiple files using multiprocessing

后端未结

关注

 3  1557

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read

相关标签:

3条回答

星月不相逢

2020-12-10 18:41

Multiprocessing is more suited to CPU- or memory-oriented processes since the seek time of rotational drives kills performance when switching between files. Either load your log files into a fast flash drive or some sort of memory disk (physical or virtual), or give up on multiprocessing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-12-10 18:46

You're probably hitting two problems.

One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

Two threads: 0.36elapsed 168%CPU

One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-12-10 18:56

You're creating a pool with as many workers as files. That may be too many. Usually, I aim to have the number of workers around the same as the number of cores.

The simple fact is that your final step is going to be a single process merging all the results together. There is no avoiding this, given your problem description. This is known as a barrier synchronization: all tasks have to reach the same point before any can proceed.

You should probably run this program multiple times, or in a loop, passing a different value to multiprocessing.Pool() each time, starting at 1 and going to the number of cores. Time each run, and see which worker count does best.

The result will depend on how CPU-intensive (as opposed to disk-intensive) your task is. I would not be surprised if 2 were best if your task is about half CPU and half disk, even on an 8-core machine.

0 讨论(0)
发布评论:

提交评论
- 加载中...