I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop
You spotted the timing error. There is still no speedup because both of your test cases are heavily memory bound. On typical consumer hardware all of your cores share one memory bus, so using more threads does not give you more bandwidth and, since this is the bottleneck, speedup. This will probably change if you reduce your problem size so it will fit into cache or for sure if you increase the number of calculations per data, for example if you were calculating the reduction of exp(vec[i]) or 1/vec[i]. For the memset: you can saturate the memory with one thread, you will never see a speedup there. (Only if you have access to a second memory bus with more threads, as with some multi-socket architectures). One remark regarding the reduction, this is most probably not implemented with a lock, that would be horrible inefficient but using an addition tree which has not so bad logarithmic speedup.