400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

后端 未结 1 761
情书的邮戳
情书的邮戳 2020-12-10 08:35

Experimental Code

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within

相关标签:
1条回答
  • 2020-12-10 09:06

    Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU.

    So adding more threads than CPUs works here as long as I/O is still the bottleneck. The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first).


    As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. Python threads in the same process need to wait not only for I/O but for each other, too.
    When dealing with I/O, Python machinery:

    1. Converts all Python objects involved into C objects (in many cases, this can be done without physically copying the data)
    2. Releases the GIL
    3. Perform the I/O in C (which involves waiting for it for arbitrary time)
    4. Reacquires the GIL
    5. Converts the result to a Python object if applicable

    If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay.


    Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). That's why the performance becomes unstable.

    0 讨论(0)
提交回复
热议问题