increase number of threads decrease time

时光毁灭记忆、已成空白 提交于 2019-12-04 05:35:51

问题


I'm newbie in openmp. Beginning with a tutorial from the official page of openmp https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

In that page there is a hello world program to calculate pi by an approximation of integral. I simply wrote the code below following the instructions but the time-speed of it increase as I increase the number of threads changing the NUM_THREADS. In the video, the speed goes down.

I'm executing the program in a remote server with 64 cpus having 8 cores each.

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000;
double step;

#define NUM_THREADS 2 

int main()
{
    int i, nthreads; double pi, sum[NUM_THREADS];
    double start_t;

    step = 1.0 / (double) num_steps;

    omp_set_num_threads(NUM_THREADS);

    start_t = omp_get_wtime();
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;

        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if (id == 0) nthreads = nthrds;
        for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
            x = (i + 0.5) * step;
            sum[id] += 4.0 / (1.0 + x*x);
        }
    }
    for (i = 0, pi = 0.0; i < nthreads; i++) {
        pi += sum[i] * step;
    }
    printf("%f\n", omp_get_wtime() - start_t);
}

回答1:


This is a bad approach to implementing reduction using shared arrays. The successive elements of sum are too close to each other and therefore reside in the same cache line. On cache-coherent architectures like x86/x64, this leads to a problem known as false sharing. The following simple modification will get rid of it:

double sum[8*NUM_THREADS];

#pragma omp parallel
{
    ...
    for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
        ...
        sum[8*id] += 4.0 / (1.0 + x*x);
    }
}
for (i = 0, pi = 0.0; i < nthreads; i++) {
    pi += sum[8*i] * step;
}

Only the relevant changes are shown. The idea is simple: instead of having threads access successive elements of sum, make them access every 8-th element. Thus it is guaranteed that threads do not share the same cache line as on most modern CPUs a cache line is 64 bytes long and that corresponds to 64 / sizeof(double) = 8 array elements.

Edit: my mistake, should have watched the video in the first place. False sharing is explained just after the results from running the code are shown. If you don't get any speedup in your case, that's probably because newer CPU generations handle false sharing better.



来源:https://stackoverflow.com/questions/22582182/increase-number-of-threads-decrease-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!