read multiple files using multiprocessing

后端未结

关注

 3  1558

离开以前 2020-12-10 18:23

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read

3条回答

暖寄归人 (楼主)

2020-12-10 18:46

You're probably hitting two problems.

One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

Two threads: 0.36elapsed 168%CPU

One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...