Why does OpenMP fail to sum these numbers?

自闭症网瘾萝莉.ら 提交于 2019-12-11 04:16:43

问题


Consider the following minimal C code example. When compiling and executing with export OMP_NUM_THREADS=4 && gcc -fopenmp minimal2.c && ./a.out (homebrew GCC 5.2.0 on OS X 10.11), this usually produces the correct behavior, i.e. seven lines with the same number. But sometimes, this happens:

[ ] bsum=1.893293142303100e+03
[1] asum=1.893293142303100e+03
[2] asum=1.893293142303100e+03
[0] asum=1.893293142303100e+03
[3] asum=3.786586284606200e+03
[ ] bsum=1.893293142303100e+03
[ ] asum=3.786586284606200e+03
equal: 0

It looks like a race condition, but my code seems fine to me. What am I doing wrong?

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#ifdef _OPENMP
#include <omp.h>
#define ID omp_get_thread_num()
#else
#define ID 0
#endif
#define N 1400

double a[N];

double verify() {
    int i;
    double bsum = 0.0;
    for (i = 0; i < N; i++) {
        bsum += a[i] * a[i];
    }
    fprintf(stderr, "[ ] bsum=%.15e\n", bsum);
    return bsum;
}

int main(int argc, char *argv[]) {
    int i;
    double asum = 0.0, bsum;
    srand((unsigned int)time(NULL));
    //srand(1445167001); // fails on my machine
    for (i = 0; i < N; i++) {
        a[i] = 2 * (double)rand()/(double)RAND_MAX;
    }
    bsum = verify();
    #pragma omp parallel shared(asum)
    {
        #pragma omp for reduction(+: asum)
        for (i = 0; i < N; i++) {
            asum += a[i] * a[i];
        }
        fprintf(stderr, "[%d] asum=%.15e\n", ID, asum);
    }
    bsum = verify();
    fprintf(stderr, "[ ] asum=%.15e\n", asum);
    return 0;
}

EDIT: Gilles brought to my attention that the errors beginning at the 15th significant digit are normal as I overestimated the precision of a double. I also cannot reproduce the faulty behavior with 2x the correct number on the Debian machine, so this might be homebrew gcc or Mac related.

I had a problem with a similar issue here, but the two do not seem to be related (at least in my eyes), so I started this as a separate question.


回答1:


I strongly suspect that this is because floating-point addition is not associative. As a result, OpenMP sums the multiplications in different orders, yielding slightly different results.

The OpenMP 4.0 spec, section 1.3 Execution Model says:

For example, a serial addition reduction may have a different pattern of addition associations than a parallel reduction. These different associations may change the results of floating-point addition.

See OpenMP parallel for reduction delivers wrong results for a suggested solution.



来源:https://stackoverflow.com/questions/33190809/why-does-openmp-fail-to-sum-these-numbers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!