x86 equivalent for LWARX and STWCX

前端 未结 6 1656
天命终不由人
天命终不由人 2020-12-17 22:07

I\'m looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be

相关标签:
6条回答
  • 2020-12-17 22:15

    As Michael mentioned, what you're probably looking for is the cmpxchg instruction.

    It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).

    If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

    0 讨论(0)
  • 2020-12-17 22:16

    You're probably looking for the cmpxchg family of instructions.

    You'll need to precede these with a lock instruction to get equivalent behaviour.

    Have a look here for a quick overview of what's available.

    You'll likely end up with something similar to this:

    mov ecx,dword ptr [esp+4]
    mov edx,dword ptr [esp+8]
    mov eax,dword ptr [esp+12]
    lock cmpxchg dword ptr [ecx],edx
    ret 12
    

    You should read this paper...

    Edit

    In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

    0 讨论(0)
  • 2020-12-17 22:16

    Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.

    Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

    0 讨论(0)
  • 2020-12-17 22:25

    What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).

    The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

    0 讨论(0)
  • 2020-12-17 22:27

    x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

    0 讨论(0)
  • 2020-12-17 22:34

    if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.

    int* IncrementAndRetrieve(int **ptr)
    {
      int val;
      int *unpacked;
      do
      {   
        val = *ptr;
        unpacked = unpack(val);
    
        if(unpacked == NULL)
          return NULL;
        // pointer is on the bottom
      } while(!cas(unpacked, val, val + 1));
      return unpacked;
    }
    
    0 讨论(0)
提交回复
热议问题