OpenMP embarrassingly parallel for loop, no speedup

前端未结

关注

 3  846

故里飘歌 2021-01-05 08:41

I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop

3条回答

情歌与酒 (楼主)

2021-01-05 08:49

Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.

in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

memset-in-parallel-with-threads-bound-to-each-physical-core

So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.

one thread

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.80       6.68       0.00        inf %
memcpy:              1.00       1.35       7.93       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.71       7.53       0.00        inf %
memcpy:              1.00       0.93      11.51       0.00        inf %

my_memset   
                     0.50       0.71       7.53       0.00        inf %


FMA3 reduction tests
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.53      10.08       2.52        inf %
vmul:                0.50       0.68       7.93       1.98        inf %
vtriad:              0.50       0.70       7.71       3.85        inf %
dot                  1.00       1.08       9.93       2.48        inf %

two threads

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.64       8.33       0.00        inf %
memcpy:              1.00       1.10       9.76       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.36      14.98       0.00        inf %
memcpy:              1.00       0.66      16.30       0.00        inf %

my_memset
                     0.50       0.36      15.03       0.00        inf %


FMA3 tests
standard sum tests with OpenMP: 2 threads
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.41      13.03       3.26        inf %
vmul:                0.50       0.39      13.67       3.42        inf %
vtriad:              0.50       0.44      12.20       6.10        inf %
dot                  1.00       0.97      11.11       2.78        inf %

Here is my custom memset function (I have several other tests like this).

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i



Edit:

You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules. 

double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) { 
    #pragma omp parallel for reduction(+:sum)
    for(int i = 0; i < sz_i; ++ i)
        sum += ptr[i];
}
toc();
printf("sum %f\n", sum);


Edit:

I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads

max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s


Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)

#include 
#include 
#include 
#include 
#include 

#include 

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    for(i=0; i vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    double dtime;
    int sum;

    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f, sum %d\n", dtime, sum);

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;


    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s, sum %d\n", dtime, sum);

    std::cout << "serial memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }   
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "serial my_memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);

    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel my_memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    return 0;
}