Eigen & OpenMP : No parallelisation due to false sharing and thread overhead

问题

System Specification:

Intel Xeon E7-v3 Processor(4 sockets, 16 cores/sockets, 2 threads/core)
Use of Eigen family and C++

Following is serial implementation of code snippet:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);
    for (int k=0; k<nCols; ++k) {
        row(k) = get_Matrix_Entry(j,k+nColStart);
    }

} 

double get_Matrix_Entry(int x , int y){
    return exp(-(x-y)*(x-y));
}

I need to parallelise the get_Row part as nCols can be as large as 10^6, therefore, I tried certain techniques as:

Naive parallelisation:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {  
    Eigen::VectorXd row(nCols);

     #pragma omp parallel for schedule(static,8)    
     for (int k=0; k<nCols; ++k) {
          row(k)    =   get_Matrix_Entry(j,k+nColStart);

     return row;
}

Strip Mining:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) { 
    int vec_len = 8;
    Eigen::VectorXd row(nCols) ;
    int i,cols;
    cols=nCols;
    int rem = cols%vec_len;
    if(rem!=0)
        cols-=rem;

    #pragma omp parallel for    
    for(int ii=0;ii<cols; ii+=vec_len){
         for(i=ii;i<ii+vec_len;i++){
             row(i) = get_Matrix_Entry(j,i+nColStart);
         }
    }

    for(int jj=i; jj<nCols;jj++)
        row(jj) = get_Matrix_Entry(j,jj+nColStart);

    return row;
}

Somewhere from internet to avoid false sharing:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {
    int cache_line_size=8;
    Eigen::MatrixXd row_m(nCols,cache_line_size);

    #pragma omp parallel for schedule(static,1)
    for (int k=0; k<nCols; ++k) 
        row_m(k,0)  =   get_Matrix_Entry(j,k+nColStart);

    Eigen::VectorXd row(nCols); 
    row = row_m.block(0,0,nCols,1);

   return row;

}

OUTPUT:

None of the above techniques helped in reducing the time taken to execute get_row for large nCols implying naice parallelisation worked similar to the other techniques(although better from serial), any suggestions or method that can help to improve the time?

As mentioned by user Avi Ginsburg, I am mentioning some other system details:

g++(GCC) is compiler with version 4.4.7
Eigen Library Version is 3.3.2
Compiler flags used: "-c -fopenmp -Wall -march=native -O3 -funroll-all-loops -ffast-math -ffinite-math-only -I header" , here header is folder containing Eigen.
Output of gcc -march=native -Q --help=target->(Description of some flags are mentioned only):

-mavx [enabled]

-mfancy-math-387 [enabled]

-mfma [disabled]

-march= core2

For full desciption of flags please see this.

回答1:

Try rewriting your functions as a single expression and let Eigen vectorize itself, i.e.:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);

    row = (-( Eigen::VectorXd::LinSpaced(nCols, nColStart, nColStart + nCols - 1).array()
                      - double(j)).square()).exp().matrix();

    return row;
}

Make sure to use -mavx and -mfma (or -march=native) when compiling. Gives me a x4 speedup on an i7 (I know you are talking about trying to use 64/128 threads, but this is with a single thread).

You can enable openmp for some further speedup by dividing the computation into segments:

Eigen::VectorXd get_Row_omp(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);

#pragma omp parallel
    {
        int num_threads = omp_get_num_threads();
        int tid = omp_get_thread_num();
        int n_per_thread = nCols / num_threads;
        if ((n_per_thread * num_threads < nCols)) n_per_thread++;
        int start = tid * n_per_thread;
        int len = n_per_thread;
        if (tid + 1 == num_threads) len = nCols - start;

        if(start < nCols)
            row.segment(start, len) = (-(Eigen::VectorXd::LinSpaced(len,
                               nColStart + start, nColStart + start + len - 1)
                            .array() - double(j)).square()).exp().matrix();

    }
    return row;

}

For me (4 cores), I get an additional ~x3.3 speedup when computing 10^8 elements, but expect this be lower for 10^6 and/or 64/128 cores (normalizing for number of cores, of course).

Edit

I hadn't placed any checks to make sure that the OMP threads didn't go out of bounds and I had mixed up the second and third arguments in the Eigen::VectorXd::LinSpaced of the serial version. That probably accounted for any errors you had. Additionally, I've pasted the code that I used for testing here. I compiled with g++ -std=c++11 -fopenmp -march=native -O3, adapt to your needs.

#include <Eigen/Core>
#include <iostream>
#include <omp.h>


double get_Matrix_Entry(int x, int y) {
        return exp(-(x - y)*(x - y));
}

Eigen::VectorXd get_RowOld(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);
        for (int k = 0; k<nCols; ++k) {
                row(k) = get_Matrix_Entry(j, k + nColStart);
        }
        return row;
}


Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);

        row = (-( Eigen::VectorXd::LinSpaced(nCols, nColStart, nColStart + nCols - 1).array() - double(j)).square()).exp().matrix();

        return row;
}

Eigen::VectorXd get_Row_omp(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);

#pragma omp parallel
        {
                int num_threads = omp_get_num_threads();
                int tid = omp_get_thread_num();
                int n_per_thread = nCols / num_threads;
                if ((n_per_thread * num_threads < nCols)) n_per_thread++;
                int start = tid * n_per_thread;
                int len = n_per_thread;
                if (tid + 1 == num_threads) len = nCols - start;


#pragma omp critical
{
        std::cout << tid << "/" << num_threads << "\t" << n_per_thread << "\t" << start <<
                                                         "\t" << len << "\t" << start+len << "\n\n";
}

                if(start < nCols)
                        row.segment(start, len) = (-(Eigen::VectorXd::LinSpaced(len, nColStart + start, nColStart + start + len - 1).array() - double(j)).square()).exp().matrix();

        }
        return row;
}

int main()
{
        std::cout << EIGEN_WORLD_VERSION << '.' << EIGEN_MAJOR_VERSION << '.' << EIGEN_MINOR_VERSION << '\n';
        volatile int b = 3;
        int sz = 6553600;
        sz = 16;
        b = 6553500;
        b = 3;
        {
                auto beg = omp_get_wtime();
                auto r = get_RowOld(5, b, sz);
                auto end = omp_get_wtime();
                auto diff = end - beg;
                std::cout << r.rows() << "\t" << r.cols() << "\n";
//              std::cout << r.transpose() << "\n";
                std::cout << "Old: " << r.mean() << "\n" << diff << "\n\n";

                beg = omp_get_wtime();
                auto r2 = get_Row(5, b, sz);
                end = omp_get_wtime();
                diff = end - beg;
                std::cout << r2.rows() << "\t" << r2.cols() << "\n";
//              std::cout << r2.transpose() << "\n";
                std::cout << "Eigen:         " << (r2-r).cwiseAbs().sum() << "\t" << (r-r2).cwiseAbs().mean() << "\n" << diff << "\n\n";

                auto omp_beg = omp_get_wtime();
                auto r3 = get_Row_omp(5, b, sz);
                auto omp_end = omp_get_wtime();
                auto omp_diff = omp_end - omp_beg;
                std::cout << r3.rows() << "\t" << r3.cols() << "\n";
//              std::cout << r3.transpose() << "\n";
                std::cout << "OMP and Eigen: " << (r3-r).cwiseAbs().sum() << "\t" << (r - r3).cwiseAbs().mean() << "\n" << omp_diff << "\n";
        }

        return 0;

}

来源：https://stackoverflow.com/questions/41845724/eigen-openmp-no-parallelisation-due-to-false-sharing-and-thread-overhead

标签

c++

parallel-processing

openmp

eigen

false-sharing