Is it normal that the gcc atomic builtins are so slow?

I have an application where I have to increment some statistics counters in a multi-threaded method. The incrementing has to be thread-safe, so I decided to use the gcc atomic builtins __sync_add_and_fetch() function. Just to get an idea of their impact, I did some simple performance testing and noticed that these functions are much slower than simple pre/post incrementing.

Here is the test program that I created:

#include <iostream>
#include <pthread.h>
#include <time.h>

using namespace std;

uint64_t diffTimes(struct timespec &start, struct timespec &end)
{
  if(start.tv_sec == end.tv_sec)
  {
    return end.tv_nsec - start.tv_nsec;
  }
  else if(start.tv_sec < end.tv_sec)
  {
    uint64_t nsecs = (end.tv_sec - start.tv_sec) * 1000000000;
    return nsecs + end.tv_nsec - start.tv_nsec;
  }
  else
  {
    // this is actually an error
    return 0;
  }
}

void outputResult(const char *msg, struct timespec &start, struct timespec &end, uint32_t numIterations, uint64_t val)
{
  uint64_t diff = diffTimes(start, end);
  cout << msg << ": "
       << "\n\t iterations: " << numIterations
       << ", result: " << val
       << "\n\t times [start, end] =  [" << start.tv_sec << ", " << start.tv_nsec << "]"
       << "\n\t [" << end.tv_sec << ", " << end.tv_nsec << "]"
       << "\n\t [total, avg] = [" << diff
       << ", " << (diff/numIterations) << "] nano seconds"
       << endl;
}

int main(int argc, char **argv)
{
  struct timespec start, end;
  uint64_t val = 0;
  uint32_t numIterations = 1000000;

  //
  // NON ATOMIC pre increment
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    ++val;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Non-Atomic pre-increment", start, end, numIterations, val);
  val = 0;

  //
  // NON ATOMIC post increment
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    val++;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Non-Atomic post-increment", start, end, numIterations, val);
  val = 0;

  //
  // ATOMIC add and fetch
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    __sync_add_and_fetch(&val, 1);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Atomic add and fetch", start, end, numIterations, val);
  val = 0;

  //
  // ATOMIC fetch and add
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    __sync_fetch_and_add(&val, 1);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Atomic fetch and add", start, end, numIterations, val);
  val = 0;

  //
  // Mutex protected post-increment
  //
  pthread_mutex_t mutex;
  pthread_mutex_init(&mutex, NULL);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    pthread_mutex_lock(&mutex);
    val++;
    pthread_mutex_unlock(&mutex);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Mutex post-increment", start, end, numIterations, val);
  val = 0;

  //
  // RWlock protected post-increment
  //
  pthread_rwlock_t rwlock;
  pthread_rwlock_init(&rwlock, NULL);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    pthread_rwlock_wrlock(&rwlock);
    val++;
    pthread_rwlock_unlock(&rwlock);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("RWlock post-increment", start, end, numIterations, val);
  val = 0;

  return 0;
}

And here are the results:

# ./atomicVsNonAtomic
Non-Atomic pre-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1585375]
         [0, 1586185]
         [total, avg] = [810, 0] nano seconds
Non-Atomic post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1667489]
         [0, 1667920]
         [total, avg] = [431, 0] nano seconds
Atomic add and fetch:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1682037]
         [0, 16595016]
         [total, avg] = [14912979, 14] nano seconds
Atomic fetch and add:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 16617178]
         [0, 31499571]
         [total, avg] = [14882393, 14] nano seconds
Mutex post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 31526810]
         [0, 68515763]
         [total, avg] = [36988953, 36] nano seconds
RWlock post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 68547649]
         [0, 110877351]
         [total, avg] = [42329702, 42] nano seconds

Here is the gcc compilation:

g++ -o atomicVsNonAtomic.o -c -march=i686 -O2 -I. atomicVsNonAtomic.cc
g++ -o atomicVsNonAtomic atomicVsNonAtomic.o -lrt -lpthread

And related info and versions:

# gcc --version
gcc (GCC) 4.3.2
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

# uname -a
Linux gtcba2v1 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 i686 i686 i386 GNU/Linux

And now for the actual question :) Is it normal that the atomic operations are so much slower?

The difference for one million iterations is:

simple post-increment: 431 nano seconds
atomic fetch and add operation: 14882393 nano seconds

Of course I understand that an atomic operation should be more costly, but this seems exaggerated. Just for completeness, I also checked a pthread mutex and rwlock. At least the atomic operations are faster then the pthread operations, but Im still wondering if I've done something wrong. I couldnt get it to link without specifying the -march=i686 option, maybe this has an impact?

UPDATE:

I took out the -O2 compiler optimization and was able to get more coherent results as follows:

# ./atomicVsNonAtomic
Non-Atomic pre-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1647303]
         [0, 4171164]
         [total, avg] = [2523861, 2] nano seconds
Non-Atomic post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 4310230]
         [0, 7262704]
         [total, avg] = [2952474, 2] nano seconds
Atomic add and fetch:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 7285996]
         [0, 25919067]
         [total, avg] = [18633071, 18] nano seconds
Atomic fetch and add:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 25941677]
         [0, 44544234]
         [total, avg] = [18602557, 18] nano seconds
Mutex post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 44573933]
         [0, 82318615]
         [total, avg] = [37744682, 37] nano seconds
RWlock post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 82344866]
         [0, 125124498]
         [total, avg] = [42779632, 42] nano seconds

The answer is that GCC optimizes your non-atomic increments away. When it sees a loop like:

for (int i=0; i<N; i++) x++;

it replaces it with:

x += N;

This can be seen in the generated assembly, which contains:

call    clock_gettime
leal    -32(%ebp), %edx
addl    $1000000, -40(%ebp)     <- increment by 1000000
adcl    $0, -36(%ebp)
movl    %edx, 4(%esp)
movl    $2, (%esp)
call    clock_gettime

So you are not measuring what you think you are.

You can make your variable volatile to prevent this optimization. On my computer, after doing this, non-atomic access is about 8 times as fast as atomic access. When using a 32-bit variable instead of 64-bit (I'm compiling as 32-bit), the difference drops to about a factor of 3.

I'm guessing that gcc is optimizing your non-atomic increment operation to something like

val += numIterations;

You say that 10^6 increments are taking 431 nanoseconds, which works out to 0.000431 ns per loop iteration. On a 4 GHz processor, a clock cycle is 0.25 ns, so it's pretty obvious the loop is being optimized away. This explains the big performance difference you're seeing.

Edit: You measured an atomic operation as taking 14 ns -- assuming a 4 GHz processor again, that works out to 56 cycles, which is pretty decent!

The slowness of any synchronization mechanism cannot be measured by a single thread. Single-process sync objects like POSIX mutexes/Windows critical sections only really cost time when they are contested.

You would have to introduce several threads- doing other work which mirrors the time of your real application- for the synchronized methods to gain a real idea of how long it takes.

来源：https://stackoverflow.com/questions/11608869/is-it-normal-that-the-gcc-atomic-builtins-are-so-slow

标签

c++

performance

gcc

atomic

built-in