Is it normal that the gcc atomic builtins are so slow?

拈花ヽ惹草 提交于 2019-12-04 03:45:55

The answer is that GCC optimizes your non-atomic increments away. When it sees a loop like:

for (int i=0; i<N; i++) x++;

it replaces it with:

x += N;

This can be seen in the generated assembly, which contains:

call    clock_gettime
leal    -32(%ebp), %edx
addl    $1000000, -40(%ebp)     <- increment by 1000000
adcl    $0, -36(%ebp)
movl    %edx, 4(%esp)
movl    $2, (%esp)
call    clock_gettime

So you are not measuring what you think you are.

You can make your variable volatile to prevent this optimization. On my computer, after doing this, non-atomic access is about 8 times as fast as atomic access. When using a 32-bit variable instead of 64-bit (I'm compiling as 32-bit), the difference drops to about a factor of 3.

I'm guessing that gcc is optimizing your non-atomic increment operation to something like

val += numIterations;

You say that 10^6 increments are taking 431 nanoseconds, which works out to 0.000431 ns per loop iteration. On a 4 GHz processor, a clock cycle is 0.25 ns, so it's pretty obvious the loop is being optimized away. This explains the big performance difference you're seeing.

Edit: You measured an atomic operation as taking 14 ns -- assuming a 4 GHz processor again, that works out to 56 cycles, which is pretty decent!

The slowness of any synchronization mechanism cannot be measured by a single thread. Single-process sync objects like POSIX mutexes/Windows critical sections only really cost time when they are contested.

You would have to introduce several threads- doing other work which mirrors the time of your real application- for the synchronized methods to gain a real idea of how long it takes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!