When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp\'s parallel for construct, the program seems to start
When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13441 byron 20 0 379m 334m 1220 R 187 8.4 0:02.63 ompmemtestgoogle
13441 byron 20 0 1085m 1.0g 1220 R 194 26.2 0:08.52 ompmemtestgoogle
13441 byron 20 0 1111m 1.0g 1220 R 195 26.9 0:14.42 ompmemtestgoogle
13441 byron 20 0 1131m 1.1g 1220 R 195 27.4 0:20.30 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:26.19 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:32.05 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 191 27.9 0:37.81 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 194 27.9 0:43.66 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 188 28.2 0:49.32 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 194 28.2 0:55.15 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:00.90 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:06.64 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1356 R 192 28.2 1:12.42 ompmemtestgoogle
From the data I have, the answer appears to be:
Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.
The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.
The best idea I have to offer here is some sort of locking artifact within the heap.
The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.
This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.
Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.
Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.
Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.
Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)