Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I\'ve read as much as I can, and I still don\'t know what the deal is.
So after some fairly extensive profiling (thanks to this great post for info on gprof and time sampling with gdb) which involved writing a big wrapper function to generate production level code for profiling, it became obvious that for the vast majority of the time when I aborted the running code with gdb and ran backtrace the stack was in an STL call, manipulating a vector in some way.
The code passes a few vectors into the parallel section as private variables, which seemed to work fine. However, after pulling out all the vectors and replacing them with arrays (and some other jiggery-pokery to make that work) I saw a significant speed up. With small, artificial data sets the speed up is near perfect (i.e. as you double number of threads you half the time), while with real data sets the speed up isn't quite as good, but this makes complete sense as in the context of how the code works.
It seems that for whatever reason (maybe some static or global variables deep in the STL implementation?) when there are loops moving through hundreds of thousands of iterations in parallel there is some deep level locking, which occurs in Linux (Ubuntu 12.01 and CentOS 6.2) but not in OSX.
I'm really intrigued as to why I see this difference. Could it be difference in how the STL is implemented (OSX version was compiled under GNU GCC 4.7, as were the Linux ones), or is this to do with context switching (as suggested by Arne Babenhauserheide)
In summary, my debugging process was as followed;
Initial profiling from within R to identify the issue
Ensured there were no static variables acting as shared variables
Profiled with strace -f and ltrace -f which was really helpful in identifying locking as the culprit
Profiled with valgrind to look for any errors
Tried a variety of combinations for the schedule type (auto, guided, static, dynamic) and chunk size.
Tried binding threads to specific processors
Avoided false sharing by creating thread-local buffers for values, and then implement a single synchronization event at the end of the for-loop
Removed all the mallocing and freeing from within the parallel region - didn't help with the issue but did provide a small general speedup
Tried on various architectures and OSses - didn't really help in the end, but did show that this was a Linux vs. OSX issue and not a supercomputer vs. desktop one
Building a version which implements concurrency using a fork() call - having the workload between two processes. This halved the time on both OSX and Linux, which was good
Built a data simulator to replicate production data loads
gprof profiling
gdb time sampling profiling (abort and backtrace)
Comment out vector operations
Had this not worked, Arne Babenhauserheide's link looks like it may well have some crucial stuff on memory fragmentation issues with OpenMP