Efficient parallelisation of a linear algebraic function in C++ OpenMP

与世无争的帅哥 提交于 2019-12-04 18:53:19

If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.

The Art of HPC is right the efficiency ( poor grants never get HPC cluster quota )

  • so first hope is your process will never re-read from file

Why? This would be an HPC-killer:

I need to repeat this computation many thousands of times

Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.

Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.


No "...quick glance at a bit of code..." will help

You need to first understand both your whole process and also the hardware resources, it will be executed on.

CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )


The Maths first, implementation next:

As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )

On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:

dBdt = B % ( R - ( I * B ) )   ELEMENT-WISE OP B[s,n]-COLUMN-WISE
     +               ( B * D ) SUM.PRODUCT  OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
     -               ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR

 ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
 ELEMENT-WISE OP ---------------------------------------------+  |||||||||||||||||||||||||||||||||||||||||||||
 ELEMENT-WISE OP ----------------------+                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       |                      |  |||||||||||||||||||||||||||||||||||||||||||||
                                       v                      v  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 dBdt[s,n]        =   B[s,n]           % /   R[s,n]           - /  I[s,s]                  . B[s,n]           \ \
     _________[n]         _________[n]   |       _________[n]   |      ________________[s]       _________[n]  | |
    |_|       |          |_|       |     |      |_|       |     |     |________________|        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |   =      | .       |   % |      | .       |   - |     |                |   .    | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
    | .       |          | .       |     |      | .       |     |     |                |        | |       |    | |
 [s]|_________|       [s]|_________|     |   [s]|_________|     |  [s]|________________|     [s]|_|_______|    | |
                                         \                      \                                             / /

                      B[s,n]              D[n,n]          
                          _________[n]        _________[n]
                         |_________|         | |       |   
                         | .       |         | |       |  
                         | .       |         | |       |  
                         | .       |         | |       |  
                  +      | .       |   .  [n]|_|_______|   
                         | .       |      
                         | .       |      
                         | .       |             
                      [s]|_________|      


                      B[s,n]                
                          _________[n]      
                         |_| . . . |        
                         | .       |        
                         | .       |        
                         | .       |        
                  -      | .       |    *  REGISTER_e
                         | .       |        
                         | .       |        
                         | .       |        
                      [s]|_________|        

Having this in mind, efficient HPC loops will look much different

Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) ) here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.


After this cache-friendly loop-alignment is finished,

next may a distributed processing help, not before.

So, an experimentation plan setup:

Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops

Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )

Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.

( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )


Epilogue:

Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!