increase number of threads decrease time

烂漫一生 提交于 2019-12-02 10:18:29

This is a bad approach to implementing reduction using shared arrays. The successive elements of sum are too close to each other and therefore reside in the same cache line. On cache-coherent architectures like x86/x64, this leads to a problem known as false sharing. The following simple modification will get rid of it:

double sum[8*NUM_THREADS];

#pragma omp parallel
{
    ...
    for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
        ...
        sum[8*id] += 4.0 / (1.0 + x*x);
    }
}
for (i = 0, pi = 0.0; i < nthreads; i++) {
    pi += sum[8*i] * step;
}

Only the relevant changes are shown. The idea is simple: instead of having threads access successive elements of sum, make them access every 8-th element. Thus it is guaranteed that threads do not share the same cache line as on most modern CPUs a cache line is 64 bytes long and that corresponds to 64 / sizeof(double) = 8 array elements.

Edit: my mistake, should have watched the video in the first place. False sharing is explained just after the results from running the code are shown. If you don't get any speedup in your case, that's probably because newer CPU generations handle false sharing better.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!