Why are the atomics much slower than the lock in this uncontended case?

匿名 (未验证) 提交于 2019-12-03 08:59:04

问题:

I wrote something using atomics rather than locks and perplexed at it being so much slower in my case I wrote the following mini test:

#include <pthread.h> #include <vector>  struct test {     test(size_t size) : index_(0), size_(size), vec2_(size)         {             vec_.reserve(size_);             pthread_mutexattr_init(&attrs_);             pthread_mutexattr_setpshared(&attrs_, PTHREAD_PROCESS_PRIVATE);             pthread_mutexattr_settype(&attrs_, PTHREAD_MUTEX_ADAPTIVE_NP);              pthread_mutex_init(&lock_, &attrs_);         }      void lockedPush(int i);     void atomicPush(int* i);      size_t              index_;     size_t              size_;     std::vector<int>    vec_;     std::vector<int>    vec2_;     pthread_mutexattr_t attrs_;     pthread_mutex_t     lock_; };  void test::lockedPush(int i) {     pthread_mutex_lock(&lock_);     vec_.push_back(i);     pthread_mutex_unlock(&lock_); }  void test::atomicPush(int* i) {     int ii       = (int) (i - &vec2_.front());     size_t index = __sync_fetch_and_add(&index_, 1);     vec2_[index & (size_ - 1)] = ii; }  int main(int argc, char** argv) {     const size_t N = 1048576;     test t(N);  //     for (int i = 0; i < N; ++i) //         t.lockedPush(i);      for (int i = 0; i < N; ++i)         t.atomicPush(&i); } 

If I uncomment the atomicPush operation and run the test with time(1) I get output like so:

real    0m0.027s user    0m0.022s sys     0m0.005s 

and if I run the loop calling the atomic thing (the seemingly unnecessary operation is there because i want my function to look as much as possible as what my bigger code does) I get output like so:

real    0m0.046s user    0m0.043s sys     0m0.003s 

I'm not sure why this is happening as I would have expected the atomic to be faster than the lock in this case...

When I compile with -O3 I see lock and atomic updates as follows:

lock:     real    0m0.024s     user    0m0.022s     sys     0m0.001s  atomic:         real    0m0.013s     user    0m0.011s     sys     0m0.002s 

In my larger app though the performance of the lock (single threaded testing) is still doing better regardless though..

回答1:

An uncontended mutex is extremely fast to lock and unlock. With an atomic variable, you're always paying a certain memory synchronisation penalty (especially since you're not even using relaxed ordering).

Your test case is simply too naive to be useful. You have to test a heavily contended data access scenario.

Generally, atomics are slow (they get in the way of clever internal reordering, pipelining, and caching), but they allow for lock-free code which ensures that the entire program can make some progress. By contrast, if you get swapped out while holding a lock, everyone has to wait.



回答2:

Just to add to the first answer, when you do a __sync_fetch_and_add you actually enforce specific code ordering. From the documentation

A full memory barrier is created when this function is invoked

A memory barrier is when

a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction

Chances are even though your work is atomic, you are losing compiler optimizations by forcing ordering of instructions.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!