OpenMP performance

前端 未结 3 853
伪装坚强ぢ
伪装坚强ぢ 2020-12-13 14:33

Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I\'ve read as much as I can, and I still don\'t know what the deal is.

3条回答
  •  猫巷女王i
    2020-12-13 15:31

    So after some fairly extensive profiling (thanks to this great post for info on gprof and time sampling with gdb) which involved writing a big wrapper function to generate production level code for profiling, it became obvious that for the vast majority of the time when I aborted the running code with gdb and ran backtrace the stack was in an STL call, manipulating a vector in some way.

    The code passes a few vectors into the parallel section as private variables, which seemed to work fine. However, after pulling out all the vectors and replacing them with arrays (and some other jiggery-pokery to make that work) I saw a significant speed up. With small, artificial data sets the speed up is near perfect (i.e. as you double number of threads you half the time), while with real data sets the speed up isn't quite as good, but this makes complete sense as in the context of how the code works.

    It seems that for whatever reason (maybe some static or global variables deep in the STL implementation?) when there are loops moving through hundreds of thousands of iterations in parallel there is some deep level locking, which occurs in Linux (Ubuntu 12.01 and CentOS 6.2) but not in OSX.

    I'm really intrigued as to why I see this difference. Could it be difference in how the STL is implemented (OSX version was compiled under GNU GCC 4.7, as were the Linux ones), or is this to do with context switching (as suggested by Arne Babenhauserheide)

    In summary, my debugging process was as followed;

    • Initial profiling from within R to identify the issue

    • Ensured there were no static variables acting as shared variables

    • Profiled with strace -f and ltrace -f which was really helpful in identifying locking as the culprit

    • Profiled with valgrind to look for any errors

    • Tried a variety of combinations for the schedule type (auto, guided, static, dynamic) and chunk size.

    • Tried binding threads to specific processors

    • Avoided false sharing by creating thread-local buffers for values, and then implement a single synchronization event at the end of the for-loop

    • Removed all the mallocing and freeing from within the parallel region - didn't help with the issue but did provide a small general speedup

    • Tried on various architectures and OSses - didn't really help in the end, but did show that this was a Linux vs. OSX issue and not a supercomputer vs. desktop one

    • Building a version which implements concurrency using a fork() call - having the workload between two processes. This halved the time on both OSX and Linux, which was good

    • Built a data simulator to replicate production data loads

    • gprof profiling

    • gdb time sampling profiling (abort and backtrace)

    • Comment out vector operations

    • Had this not worked, Arne Babenhauserheide's link looks like it may well have some crucial stuff on memory fragmentation issues with OpenMP

提交回复
热议问题