C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?

问题

Can I base a mission-critical application on the results of this test, that 100 threads reading a pointer set a billion times by a main thread never see a tear?

Any other potential problems doing this besides tearing?

Here's a stand-alone demo that compiles with g++ -g tear.cxx -o tear -pthread.

#include <atomic>
#include <thread>
#include <vector>

using namespace std;

void* pvTearTest;
atomic<int> iTears( 0 );

void TearTest( void ) {

  while (1) {
      void* pv = (void*) pvTearTest;

      intptr_t i = (intptr_t) pv;

      if ( ( i >> 32 ) != ( i & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %p\n", pv );
          iTears++;
      }
      if ( ( i >> 32 ) == 999999999 )
          break;

  }
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );

  for ( auto& thr: athr )
      thr.join();

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

The actual application is a malloc()'ed and sometimes realloc()'d array (size is power of two; realloc doubles storage) that many child threads will absolutely be hammering in a mission-critical but also high-performance-critical way.

From time to time a thread will need to add a new entry to the array, and will do so by setting the next array entry to point to something, then increment an atomic<int> iCount. Finally it will add data to some data structures that would cause other threads to attempt to dereference that cell.

It all seems fine (except I'm not positive if the increment of count is assured of happening before following non-atomic updates)... except for one thing: realloc() will typically change the address of the array, and further frees the old one, the pointer to which is still visible to other threads.

OK, so instead of realloc(), I malloc() a new array, manually copy the contents, set the pointer to the array. I would free the old array but I realize other threads may still be accessing it: they read the array base; I free the base; a third thread allocates it writes something else there; the first thread then adds the indexed offset to the base and expects a valid pointer. I'm happy to leak those though. (Given the doubling growth, all old arrays combined are about the same size as the current array so overhead is simply an extra 16 bytes per item, and it's memory that soon is never referenced again.)

So, here's the crux of the question: once I allocate the bigger array, can I write it's base address with a non-atomic write, in utter safety? Or despite my billion-access test, do I actually have to make it atomic<> and thus slow all worker threads to read that atomic?

(As this is surely environment dependent, we're talking 2012-or-later Intel, g++ 4 to 9, and Red Hat of 2012 or later.)

EDIT: here is a modified test program that matches my planned scenario much more closely, with only a small number of writes. I've also added a count of the reads. I see when switching from void* to atomic I go from 2240 reads/sec to 660 reads/sec (with optimization disabled). The machine language for the read is shown after the source.

#include <atomic>
#include <chrono>
#include <thread>
#include <vector>

using namespace std;

chrono::time_point<chrono::high_resolution_clock> tp1, tp2;

// void*: 1169.093u 0.027s 2:26.75 796.6% 0+0k 0+0io 0pf+0w
// atomic<void*>: 6656.864u 0.348s 13:56.18 796.1%        0+0k 0+0io 0pf+0w

// Different definitions of the target variable.
atomic<void*> pvTearTest;
//void* pvTearTest;

// Children sum the tears they find, and at end, total checks performed.
atomic<int> iTears( 0 );
atomic<uint64_t> iReads( 0 );

bool bEnd = false; // main thr sets true; children all finish.

void TearTest( void ) {

  uint64_t i;
  for ( i = 0; ! bEnd; i++ ) {

      intptr_t iTearTest = (intptr_t) (void*) pvTearTest;

      // Make sure top 4 and bottom 4 bytes are the same.  If not it's a tear.
      if ( ( iTearTest >> 32 ) != ( iTearTest & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %ux\n", iTearTest );
          iTears++;
      }

      // Output periodically to prove we're seeing changing values.
      if ( ( (i+1) % 50000000 ) == 0 )
          printf( "got: pv = %lx\n", iTearTest );
  }

  iReads += i;
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  tp1 = chrono::high_resolution_clock::now();

#if 0
  // Change target as fast as possible for fixed number of updates.
  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );
#else
  // More like our actual app: change target only periodically, for fixed time.
  for ( int i = 0; i < 100; i++ ) {
      pvTearTest.store( (void*) (intptr_t) ( ( i % (1L<<32) ) * 0x100000001 ),
                        std::memory_order_release );

      this_thread::sleep_for(10ms);
  }
#endif

  bEnd = true;

  for ( auto& thr: athr )
      thr.join();

  tp2 = chrono::high_resolution_clock::now();

  chrono::duration<double> dur = tp2 - tp1;
  printf( "%ld reads in %.4f secs: %.2f reads/usec\n",
          iReads.load(), dur.count(), iReads.load() / dur.count() / 1000000 );

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

Dump of assembler code for function TearTest():
   0x0000000000401256 <+0>:     push   %rbp
   0x0000000000401257 <+1>:     mov    %rsp,%rbp
   0x000000000040125a <+4>:     sub    $0x10,%rsp
   0x000000000040125e <+8>:     movq   $0x0,-0x8(%rbp)
   0x0000000000401266 <+16>:    movzbl 0x6e83(%rip),%eax        # 0x4080f0 <bEnd>
   0x000000000040126d <+23>:    test   %al,%al
   0x000000000040126f <+25>:    jne    0x40130c <TearTest()+182>
=> 0x0000000000401275 <+31>:    mov    $0x4080d8,%edi
   0x000000000040127a <+36>:    callq  0x40193a <std::atomic<void*>::operator void*() const>
   0x000000000040127f <+41>:    mov    %rax,-0x10(%rbp)
   0x0000000000401283 <+45>:    mov    -0x10(%rbp),%rax
   0x0000000000401287 <+49>:    sar    $0x20,%rax
   0x000000000040128b <+53>:    mov    -0x10(%rbp),%rdx
   0x000000000040128f <+57>:    mov    %edx,%edx
   0x0000000000401291 <+59>:    cmp    %rdx,%rax
   0x0000000000401294 <+62>:    je     0x4012bb <TearTest()+101>
   0x0000000000401296 <+64>:    mov    -0x10(%rbp),%rax
   0x000000000040129a <+68>:    mov    %rax,%rsi
   0x000000000040129d <+71>:    mov    $0x40401a,%edi
   0x00000000004012a2 <+76>:    mov    $0x0,%eax
   0x00000000004012a7 <+81>:    callq  0x401040 <printf@plt>
   0x00000000004012ac <+86>:    mov    $0x0,%esi
   0x00000000004012b1 <+91>:    mov    $0x4080e0,%edi
   0x00000000004012b6 <+96>:    callq  0x401954 <std::__atomic_base<int>::operator++(int)>
   0x00000000004012bb <+101>:   mov    -0x8(%rbp),%rax
   0x00000000004012bf <+105>:   lea    0x1(%rax),%rcx
   0x00000000004012c3 <+109>:   movabs $0xabcc77118461cefd,%rdx
   0x00000000004012cd <+119>:   mov    %rcx,%rax
   0x00000000004012d0 <+122>:   mul    %rdx
   0x00000000004012d3 <+125>:   mov    %rdx,%rax
   0x00000000004012d6 <+128>:   shr    $0x19,%rax
   0x00000000004012da <+132>:   imul   $0x2faf080,%rax,%rax
   0x00000000004012e1 <+139>:   sub    %rax,%rcx
   0x00000000004012e4 <+142>:   mov    %rcx,%rax
   0x00000000004012e7 <+145>:   test   %rax,%rax
   0x00000000004012ea <+148>:   jne    0x401302 <TearTest()+172>
   0x00000000004012ec <+150>:   mov    -0x10(%rbp),%rax
   0x00000000004012f0 <+154>:   mov    %rax,%rsi
   0x00000000004012f3 <+157>:   mov    $0x40402a,%edi
   0x00000000004012f8 <+162>:   mov    $0x0,%eax
   0x00000000004012fd <+167>:   callq  0x401040 <printf@plt>
   0x0000000000401302 <+172>:   addq   $0x1,-0x8(%rbp)
   0x0000000000401307 <+177>:   jmpq   0x401266 <TearTest()+16>
   0x000000000040130c <+182>:   mov    -0x8(%rbp),%rax
   0x0000000000401310 <+186>:   mov    %rax,%rsi
   0x0000000000401313 <+189>:   mov    $0x4080e8,%edi
   0x0000000000401318 <+194>:   callq  0x401984 <std::__atomic_base<unsigned long>::operator+=(unsigned long)>
   0x000000000040131d <+199>:   nop
   0x000000000040131e <+200>:   leaveq
   0x000000000040131f <+201>:   retq

回答1:

Yes, on x86 aligned loads are atomic, BUT this is an architectural detail that you should NOT rely on!

Since you are writing C++ code, you have to abide by the rules of the C++ standard, i.e., you have to use atomics instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should be a strong enough indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# (in these languages volatile is in fact related to the memory model and therefore much more like an atomic in C++).

In C++, volatile is used for what is often referred to as "unusual memory". This is typically memory that can be read or modified outside the current process, for example when using memory mapped I/O. volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:

volatile int x;
         int y;
volatile int z;

x = 1;
y = 2;
z = 3;
z = 4;

...

int a = x;
int b = x;
int c = y;
int d = z;

In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to treat the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generating code like int b = a. But since x and z are volatile, these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.

Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Since aligned loads are actually atomic on x86, the compiler will translate an atomic.load() call to a simple mov instruction, so an atomic load is not slower than reading a volatile variable. An atomic.store() is actually slower than writing a volatile variable, but for good reasons, since in contrast to the volatile write it is by default sequentially consistent. You can relax the memory orders, but you really have to know what you are doing!!

If you want to learn more about the C++ memory model, I can recommend this paper: Memory Models for C/C++ Programmers

来源：https://stackoverflow.com/questions/61339630/c11-on-modern-intel-am-i-crazy-or-are-non-atomic-aligned-64-bit-load-store-ac

标签

c++11

c++14

c++17

stdthread

stdatomic