Efficient parallelisation of a linear algebraic function in C++ OpenMP

I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.

The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.

Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].

Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?

double start, end, dif;

int i,j,k;      // iteration counters
int s,n;        // matrix dimensions

mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);

double e = 0.1; // scalar parameter

s = B.n_rows; n = B.n_cols;

mat dBdt; dBdt.zeros(s,n); // object for storing output of function

// 100X sequential computation using Armadillo linear algebraic functionality

start = omp_get_wtime();

for (int r=0; r<100; r++) {
    dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}

end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;

// 100 * parallel computation using OpenMP

omp_set_num_threads(8);


for (int r=0; r<100; r++) {

//          parallel computation of I * B 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
    for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < s; k++) {
                dBdt(i, j) += I(i, k) * B(k, j);
            }
        }
     }

//          parallel computation of B % (R - (I * B)) 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
    for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            dBdt(i, j)  = R(i, j) - dBdt(i, j);
            dBdt(i, j) *= B(i, j);
            dBdt(i, j) -= B(i, j) * e;
        }
    }

//          parallel computation of B * D 
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
   for (i = 0; i < s; i++) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < n; k++) {
                dBdt(i, j) += B(i, k) * D(k, j);
            }
        }
    }    
}

end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;

If I hyper-thread 4 cores I get the following output:

Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040

Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.

It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.

Thanks in advance,

Jack

If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.

The Art of HPC is right the efficiency^{( poor grants never get HPC cluster quota )}

so first hope is your process will never re-read from file

Why? This would be an HPC-killer:

I need to repeat this computation many thousands of times

Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.

Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.

No "...quick glance at a bit of code..." will help

You need to first understand both your whole process and also the hardware resources, it will be executed on.

CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )

The Maths first, implementation next:

As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )

On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:

dBdt = B % ( R - ( I * B ) )   ELEMENT-WISE OP B[s,n]-COLUMN-WISE
     +               ( B * D ) SUM.PRODUCT  OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
     -               ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR

 ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
 ELEMENT-WISE OP ---------------------------------------------+  |||||||||||||||||||||||||||||||||||||||||||||
 ELEMENT-WISE OP ----------------------+                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       |                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       v                      v  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 dBdt[s,n]        =   B[s,n]           % /   R[s,n]           - /  I[s,s]                  . B[s,n]           \ \
     _________[n]         _________[n]   |       _________[n]   |      ________________[s]       _________[n]  | |
    |_|       |          |_|       |     |      |_|       |     |     |________________|        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |   =      | .       |   % |      | .       |   - |     |                |   .    | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
 [s]|_________|       [s]|_________|     |   [s]|_________|     |  [s]|________________|     [s]|_|_______|    | |
                                         \                      \                                             / /

                      B[s,n]              D[n,n]          
                          _________[n]        _________[n]
                         |_________|         | |       |   
                         | .       |         | |       |  
                         | .       |         | |       |  
                         | .       |         | |       |  
                  +      | .       |   .  [n]|_|_______|   
                         | .       |      
                         | .       |      
                         | .       |             
                      [s]|_________|      


                      B[s,n]                
                          _________[n]      
                         |_| . . . |        
                         | .       |        
                         | .       |        
                         | .       |        
                  -      | .       |    *  REGISTER_e
                         | .       |        
                         | .       |        
                         | .       |        
                      [s]|_________|

Having this in mind, efficient HPC loops will look much different

Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) ) here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.

After this cache-friendly loop-alignment is finished,

next may a distributed processing help, not before.

So, an experimentation plan setup:

Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops

Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )

Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.

_{( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )}

Epilogue:

Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).

来源：https://stackoverflow.com/questions/46077992/efficient-parallelisation-of-a-linear-algebraic-function-in-c-openmp

标签

c++