openMp: severe perfomance loss when calling shared references of dynamic arrays

问题

I am writing a cfd simulation and want to parallelise my ~10^5 loop (lattice size), which is part of a member function. The implementation of the openMp code is straight forward: I read entries of shared arrays, do calculations with thread-private quantities and finally write in a shared array again. In every array I only access the array element of the loop number, so I don't expect a race condition and I don't see any reason to flush. Testing the speedup of the code(the parallel part), I find that all but one cpu run at only ~70%. Has anybody an idea how to improve this?

void class::funcPar(bool parallel){
#pragma omp parallel
{
    int one, two, three;
    double four, five;

    #pragma omp for
    for(int b=0; b<lenAr; b++){
        one = A[b]+B[b];
        C[b] = one;
        one += D[b];
        E[b] = one;
    }
}

}

回答1:

Several points, then test code, then discussion:

10^5 isn't that much if each item is an int. The incurred overhead of launching multiple threads might be greater than the benefits.
Compiler optimizations can get messed with when using OMP.
When dealing with few operations per set of memory, loops can be memory bound (i.e. the CPU spends time waiting for requested memory to be delivered)

As promised, here's the code:

#include <iostream>
#include <chrono>
#include <Eigen/Core>


Eigen::VectorXi A;
Eigen::VectorXi B;
Eigen::VectorXi D;
Eigen::VectorXi C;
Eigen::VectorXi E;
int size;

void regular()
{
    //#pragma omp parallel
    {
        int one;
//      #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void parallel()
{
#pragma omp parallel
    {
        int one;
        #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void vectorized()
{
    C = A+B;
    E = C+D;
}

void both()
{
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        int vals = size / nthreads;
        int startInd = tid * vals;
        if(tid == nthreads - 1)
            vals += size - nthreads * vals;
        auto am = Eigen::Map<Eigen::VectorXi>(A.data() + startInd, vals);
        auto bm = Eigen::Map<Eigen::VectorXi>(B.data() + startInd, vals);
        auto cm = Eigen::Map<Eigen::VectorXi>(C.data() + startInd, vals);
        auto dm = Eigen::Map<Eigen::VectorXi>(D.data() + startInd, vals);
        auto em = Eigen::Map<Eigen::VectorXi>(E.data() + startInd, vals);
        cm = am+bm;
        em = cm+dm;
    }
}
int main(int argc, char* argv[])
{
    srand(time(NULL));
    size = 100000;
    int iterations = 10;
    if(argc > 1)
        size = atoi(argv[1]);
    if(argc > 2)
        iterations = atoi(argv[2]);
    std::cout << "Size: " << size << "\n";
    A = Eigen::VectorXi::Random(size);
    B = Eigen::VectorXi::Random(size);
    D = Eigen::VectorXi::Random(size);
    C = Eigen::VectorXi::Zero(size);
    E = Eigen::VectorXi::Zero(size);

    auto startReg = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        regular();
    auto endReg = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPar = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        parallel();
    auto endPar = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startVec = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        vectorized();
    auto endVec = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPVc = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        both();
    auto endPVc = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    std::cout << "Timings:\n";
    std::cout << "Regular:    " << std::chrono::duration_cast<std::chrono::microseconds>(endReg - startReg).count() / iterations << "\n";
    std::cout << "Parallel:   " << std::chrono::duration_cast<std::chrono::microseconds>(endPar - startPar).count() / iterations << "\n";
    std::cout << "Vectorized: " << std::chrono::duration_cast<std::chrono::microseconds>(endVec - startVec).count() / iterations << "\n";
    std::cout << "Both      : " << std::chrono::duration_cast<std::chrono::microseconds>(endPVc - startPVc).count() / iterations << "\n";

    return 0;
}

I used Eigen as a vector library to help prove a point re:optimizations, which I'll reach soon. Code was compiled in four different optimization modes:

g++ -fopenmp -std=c++11 -Wall -pedantic -pthread -I C:\usr\include source.cpp -o a.exe

g++ -fopenmp -std=c++11 -Wall -pedantic -pthread -O1 -I C:\usr\include source.cpp -o aO1.exe

g++ -fopenmp -std=c++11 -Wall -pedantic -pthread -O2 -I C:\usr\include source.cpp -o aO2.exe

g++ -fopenmp -std=c++11 -Wall -pedantic -pthread -O3 -I C:\usr\include source.cpp -o aO3.exe

using g++ (x86_64-posix-sjlj, built by strawberryperl.com project) 4.8.3 under Windows.

Discussion

We'll start by looking at 10^5 vs 10^6 elements, averaged 100 times without optimizations.

a 10^5 (without optimizations):

Timings:
Regular:    9300
Parallel:   2620
Vectorized: 2170
Both      : 910

a 10^6 (without optimizations):

Timings:
Regular:    93535
Parallel:   27191
Vectorized: 21831
Both      : 8600

Vectorization (SIMD) trumps OMP in terms of speedup. Combined, we get even better times.

Moving to -O1:

10^5:

Timings:
Regular:    780
Parallel:   300
Vectorized: 80
Both      : 80

10^6:

Timings:
Regular:    7340
Parallel:   2220
Vectorized: 1830
Both      : 1670

Same as without optimizations except that timings are much better.

Skipping ahead to -O3:

10^5:

Timings:
Regular:    380
Parallel:   130
Vectorized: 80
Both      : 70

10^6:

Timings:
Regular:    3080
Parallel:   1750
Vectorized: 1810
Both      : 1680

For 10^5, optimizations still trump. However, 10^6 gives faster timings for OMP loops than the vectorization.

In all the tests, we got about a x2-x4 speedup for OMP.

Note: I originally ran the tests when I had another low priority process using all the cores. For some reason, this affected mainly the parallel tests, and not the others. Make sure you time things correctly.

Conclusion

Your minimal code example does not behave as claimed. Issues such as memory access patterns can arise with more complex data. Add enough detail to accurately reproduce your problem (MCVE) to get better help.

来源：https://stackoverflow.com/questions/30886056/openmp-severe-perfomance-loss-when-calling-shared-references-of-dynamic-arrays

标签

c++

optimization

openmp