Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I\'ve read as much as I can, and I still don\'t know what the deal is.
It's hard to know for sure what is happening without significant profiling, but the performance curve seems indicative of False Sharing...
threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time
Great article on the topic at Dr Dobbs
http://www.drdobbs.com/go-parallel/article/217500206?pgno=1
In particular the fact that the routines are doing a lot of malloc/free could lead to this.
One solution is to use a pool based memory allocator rather than the default allocator so that each thread tends to allocate memory from a different physical address range.