This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1+
You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.