This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1+
Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.