Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within
Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU.
So adding more threads than CPUs works here as long as I/O is still the bottleneck. The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first).
As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. Python threads in the same process need to wait not only for I/O but for each other, too.
When dealing with I/O, Python machinery:
If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay.
Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). That's why the performance becomes unstable.