OpenMP embarrassingly parallel for loop, no speedup

前端 未结 3 846
故里飘歌
故里飘歌 2021-01-05 08:41

I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop

3条回答
  •  情歌与酒
    2021-01-05 08:49

    Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.

    in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

    measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

    memset-in-parallel-with-threads-bound-to-each-physical-core

    So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.

    one thread

    C standard library
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.80       6.68       0.00        inf %
    memcpy:              1.00       1.35       7.93       0.00        inf %
    
    Agner Fog's asmlib
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.71       7.53       0.00        inf %
    memcpy:              1.00       0.93      11.51       0.00        inf %
    
    my_memset   
                         0.50       0.71       7.53       0.00        inf %
    
    
    FMA3 reduction tests
                           GB    time(s)       GB/s     GFLOPS   efficiency
    vsum:                0.50       0.53      10.08       2.52        inf %
    vmul:                0.50       0.68       7.93       1.98        inf %
    vtriad:              0.50       0.70       7.71       3.85        inf %
    dot                  1.00       1.08       9.93       2.48        inf %
    

    two threads

    C standard library
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.64       8.33       0.00        inf %
    memcpy:              1.00       1.10       9.76       0.00        inf %
    
    Agner Fog's asmlib
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.36      14.98       0.00        inf %
    memcpy:              1.00       0.66      16.30       0.00        inf %
    
    my_memset
                         0.50       0.36      15.03       0.00        inf %
    
    
    FMA3 tests
    standard sum tests with OpenMP: 2 threads
                           GB    time(s)       GB/s     GFLOPS   efficiency
    vsum:                0.50       0.41      13.03       3.26        inf %
    vmul:                0.50       0.39      13.67       3.42        inf %
    vtriad:              0.50       0.44      12.20       6.10        inf %
    dot                  1.00       0.97      11.11       2.78        inf %
    

    Here is my custom memset function (I have several other tests like this).

    void my_memset(int *s, int c, size_t n) {
        int i;
        __m128i v = _mm_set1_epi32(c);
        #pragma omp parallel for
        for(i=0; i

    Edit:

    You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.

    double sum = 0;
    tic();
    for(int c = 0; c < COUNT; ++ c) { 
        #pragma omp parallel for reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    toc();
    printf("sum %f\n", sum);
    

    Edit:

    I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads

    max threads: 4
    serial reduction
    dtime 1.86, sum 705032704
    parallel reduction
    dtime 1.39 s, sum 705032704
    serial memset
    dtime 2.95 s
    parallel memset
    dtime 2.44 s
    serial my_memset
    dtime 2.66 s
    parallel my_memset
    dtime 1.35 s
    

    Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)

    #include 
    #include 
    #include 
    #include 
    #include 
    
    #include 
    
    void my_memset(int *s, int c, size_t n) {
        int i;
        __m128i v = _mm_set1_epi32(c);
        for(i=0; i vec(sz, 1);
    
        std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
    
        std::cout << "serial reduction" << std::endl;
        double dtime;
        int sum;
    
        dtime = -omp_get_wtime();
        sum = 0;
        for(int c = 0; c < COUNT; ++ c) {
            for(size_t i = 0; i < sz; ++ i)
                sum += vec[i];
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f, sum %d\n", dtime, sum);
    
        int *const ptr = vec.data();
        const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
    
        std::cout << "parallel reduction" << std::endl;
    
    
        dtime = -omp_get_wtime();
        sum = 0;
        for(int c = 0; c < COUNT; ++ c) {
            #pragma omp parallel for default(none) reduction(+:sum)
            for(int i = 0; i < sz_i; ++ i)
                sum += ptr[i];
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f s, sum %d\n", dtime, sum);
    
        std::cout << "serial memset" << std::endl;
    
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) {
            for(size_t i = 0; i < sz; ++ i)
                vec[i] = 0;
        }   
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "parallel memset" << std::endl;
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) {
            #pragma omp parallel for default(none)
            for(int i = 0; i < sz_i; ++ i)
                ptr[i] = 0;
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "serial my_memset" << std::endl;
    
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);
    
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "parallel my_memset" << std::endl;
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        return 0;
    }
    

提交回复
热议问题