read multiple files using multiprocessing

后端 未结 3 1558
离开以前
离开以前 2020-12-10 18:23

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read

3条回答
  •  暖寄归人
    2020-12-10 18:46

    You're probably hitting two problems.

    One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

    Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

    If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

    Two threads: 0.36elapsed 168%CPU

    One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

    Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

提交回复
热议问题