OpenMP embarrassingly parallel for loop, no speedup

前端 未结 3 829
故里飘歌
故里飘歌 2021-01-05 08:41

I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop

相关标签:
3条回答
  • 2021-01-05 08:49

    Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.

    in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

    measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

    memset-in-parallel-with-threads-bound-to-each-physical-core

    So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.

    one thread

    C standard library
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.80       6.68       0.00        inf %
    memcpy:              1.00       1.35       7.93       0.00        inf %
    
    Agner Fog's asmlib
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.71       7.53       0.00        inf %
    memcpy:              1.00       0.93      11.51       0.00        inf %
    
    my_memset   
                         0.50       0.71       7.53       0.00        inf %
    
    
    FMA3 reduction tests
                           GB    time(s)       GB/s     GFLOPS   efficiency
    vsum:                0.50       0.53      10.08       2.52        inf %
    vmul:                0.50       0.68       7.93       1.98        inf %
    vtriad:              0.50       0.70       7.71       3.85        inf %
    dot                  1.00       1.08       9.93       2.48        inf %
    

    two threads

    C standard library
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.64       8.33       0.00        inf %
    memcpy:              1.00       1.10       9.76       0.00        inf %
    
    Agner Fog's asmlib
                           GB    time(s)       GB/s     GFLOPS   efficiency
    memset:              0.50       0.36      14.98       0.00        inf %
    memcpy:              1.00       0.66      16.30       0.00        inf %
    
    my_memset
                         0.50       0.36      15.03       0.00        inf %
    
    
    FMA3 tests
    standard sum tests with OpenMP: 2 threads
                           GB    time(s)       GB/s     GFLOPS   efficiency
    vsum:                0.50       0.41      13.03       3.26        inf %
    vmul:                0.50       0.39      13.67       3.42        inf %
    vtriad:              0.50       0.44      12.20       6.10        inf %
    dot                  1.00       0.97      11.11       2.78        inf %
    

    Here is my custom memset function (I have several other tests like this).

    void my_memset(int *s, int c, size_t n) {
        int i;
        __m128i v = _mm_set1_epi32(c);
        #pragma omp parallel for
        for(i=0; i<n/4; i++) {
            _mm_stream_si128((__m128i*)&s[4*i], v);
        }
    }
    

    Edit:

    You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.

    double sum = 0;
    tic();
    for(int c = 0; c < COUNT; ++ c) { 
        #pragma omp parallel for reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    toc();
    printf("sum %f\n", sum);
    

    Edit:

    I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads

    max threads: 4
    serial reduction
    dtime 1.86, sum 705032704
    parallel reduction
    dtime 1.39 s, sum 705032704
    serial memset
    dtime 2.95 s
    parallel memset
    dtime 2.44 s
    serial my_memset
    dtime 2.66 s
    parallel my_memset
    dtime 1.35 s
    

    Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)

    #include <omp.h>
    #include <vector>
    #include <iostream>
    #include <ctime>
    #include <stdio.h>
    
    #include <xmmintrin.h>
    
    void my_memset(int *s, int c, size_t n) {
        int i;
        __m128i v = _mm_set1_epi32(c);
        for(i=0; i<n/4; i++) {
            _mm_stream_si128((__m128i*)&s[4*i], v);
        }
    }
    
    void my_memset_omp(int *s, int c, size_t n) {
        int i;
        __m128i v = _mm_set1_epi32(c);
        #pragma omp parallel for
        for(i=0; i<n/4; i++) {
            _mm_stream_si128((__m128i*)&s[4*i], v);
        }
    }
    
    int main(int argc, const char *argv[])
    {
        const int COUNT = 100;
        const size_t sz = 250000 * 200;
        std::vector<int> vec(sz, 1);
    
        std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
    
        std::cout << "serial reduction" << std::endl;
        double dtime;
        int sum;
    
        dtime = -omp_get_wtime();
        sum = 0;
        for(int c = 0; c < COUNT; ++ c) {
            for(size_t i = 0; i < sz; ++ i)
                sum += vec[i];
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f, sum %d\n", dtime, sum);
    
        int *const ptr = vec.data();
        const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
    
        std::cout << "parallel reduction" << std::endl;
    
    
        dtime = -omp_get_wtime();
        sum = 0;
        for(int c = 0; c < COUNT; ++ c) {
            #pragma omp parallel for default(none) reduction(+:sum)
            for(int i = 0; i < sz_i; ++ i)
                sum += ptr[i];
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f s, sum %d\n", dtime, sum);
    
        std::cout << "serial memset" << std::endl;
    
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) {
            for(size_t i = 0; i < sz; ++ i)
                vec[i] = 0;
        }   
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "parallel memset" << std::endl;
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) {
            #pragma omp parallel for default(none)
            for(int i = 0; i < sz_i; ++ i)
                ptr[i] = 0;
        }
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "serial my_memset" << std::endl;
    
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);
    
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        std::cout << "parallel my_memset" << std::endl;
        dtime = -omp_get_wtime();
        for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
        dtime += omp_get_wtime();
        printf("dtime %.2f s\n", dtime);
    
        return 0;
    }
    
    0 讨论(0)
  • 2021-01-05 08:52

    You are using std::clock which reports CPU time used, not real time. As such each processor's time is added up and will always be higher than single threaded time (due to overhead).

    http://en.cppreference.com/w/cpp/chrono/c/clock

    0 讨论(0)
  • 2021-01-05 09:05

    You spotted the timing error. There is still no speedup because both of your test cases are heavily memory bound. On typical consumer hardware all of your cores share one memory bus, so using more threads does not give you more bandwidth and, since this is the bottleneck, speedup. This will probably change if you reduce your problem size so it will fit into cache or for sure if you increase the number of calculations per data, for example if you were calculating the reduction of exp(vec[i]) or 1/vec[i]. For the memset: you can saturate the memory with one thread, you will never see a speedup there. (Only if you have access to a second memory bus with more threads, as with some multi-socket architectures). One remark regarding the reduction, this is most probably not implemented with a lock, that would be horrible inefficient but using an addition tree which has not so bad logarithmic speedup.

    0 讨论(0)
提交回复
热议问题