x86 MESI invalidate cache line latency issue

生来就可爱ヽ(ⅴ<●) 提交于 2021-02-17 02:00:32

问题


I have the following processes , I try to make ProcessB very low latency so I use tight loop all the time and isolate cpu core 2 .

global var in shared memory :

int bDOIT ;
typedef struct XYZ_ {
    int field1 ;
    int field2 ;
    .....
    int field20;
}  XYZ;
XYZ glbXYZ ; 

static void escape(void* p) {
    asm volatile("" : : "g"(p) : "memory");
} 

ProcessA (in core 1 )

while(1){
    nonblocking_recv(fd,&iret);
    if( errno == EAGAIN)
        continue ; 
    if( iret == 1 )
        bDOIT = 1 ;
    else
        bDOIT = 0 ;
 } // while

ProcessB ( in core 2 )

while(1){
    escape(&bDOIT) ;
    if( bDOIT ){
        memcpy(localxyz,glbXYZ) ; // ignore lock issue 
        doSomething(localxyz) ;
    }
} //while 

ProcessC ( in core 3 )

while(1){
     usleep(1000) ;
     glbXYZ.field1 = xx ;
     glbXYZ.field2 = xxx ;
     ....
     glbXYZ.field20 = xxxx ;  
} //while

in these simple psudo code processes , while ProcessesA modify bDOIT to 1 , it will invalidate the cache line in Core 2 , then after ProcessB get bDOIT=1 then ProcessB will do the memcpy(localxyz,glbXYZ) .

Since evry 1000 usec ProcessC will invalidate glbXYZ in Core2 , I guess this will effect the latency while ProcessB try to do memcpy(localxyz,glbXYZ) , because while ProcessB scan bDOIT to 1 , glbXYZ is invalidated by ProcessC already ,

new value of glbXYZ still in core 3 L1$ or L2$ ,after ProcessB actually get bDOIT=1 , at this time core2 know its glbXYZ is invalidated so it ask new value of glbXYZ at this moment ,ProcessB latency is effected by waiting for the new value of glbXYZ .

My Question :

if I have a processD ( in core 4) , which do :

while(1){
    usleep(10);
    memcpy(nouseXYZ,glbXYZ);
 } //while 

will this ProcessD make glbXYZ flushed to L3$ earlier so that when ProcessB in core 2 know its glbXYZ is invalidated ,it ask the new value of glbXYZ , this ProcessD will help PrcoessB get glbXYZ earlier ?! Since ProcessD help get glbXYZ to L3$ all the time .


回答1:


Interesting idea, yeah that should probably get the cache line holding your struct into a state in L3 cache where core#2 can get an L3 hit directly, instead of having to wait for a MESI read request while the line is still in M state in the L1d of core#2.

Or if ProcessD is running on the other logical core of the same physical core as ProcessB, data will be fetched into the right L1d. If it spends most of its time asleep (and wakes up infrequently), ProcessB will still usually have the whole CPU to itself, running in single-thread mode without partitioning the ROB and store buffer.

Instead of having the dummy-access thread spinning on usleep(10), you could have it wait on a condition variable or a semaphore that ProcessC pokes after writing glbXYZ.

With a counting semaphore (like POSIX C semaphores sem_wait/sem_post), the thread that writes glbXYZ can increment the semaphore, triggering the OS to wake up ProcessD which is blocked in sem_down. If for some reason ProcessD misses its turn to wake up, it will do 2 iterations before it blocks again, but that's fine. (Hmm, so actually we don't need a counting semaphore, but I think we do want OS-assisted sleep/wake and this is an easy way to get it, unless we need to avoid the overhead of a system call in processC after writing the struct.) Or a raise() system call in ProcessC could send a signal to trigger wakeup of ProcessD.

With Spectre+Meltdown mitigation, any system call, even an efficient one like Linux futex is fairly expensive for the thread making it. This cost isn't part of the critical path that you're trying to shorten, though, and it's still much less than the 10 usec sleep you were thinking of between fetches.

void ProcessD(void) {
    while(1){
        sem_wait(something);          // allows one iteration to run per sem_post
        __builtin_prefetch (&glbXYZ, 0, 1);  // PREFETCHT2 into L2 and L3 cache
    }
}

(According to Intel's optimization manual section 7.3.2, PREFETCHT2 on current CPUs is identical to PREFETCHT1, and fetches into L2 cache (and L3 along the way. I didn't check AMD. What level of the cache does PREFETCHT2 fetch into?).

I haven't tested that PREFETCHT2 will actually be useful here on Intel or AMD CPUs. You might want to use a dummy volatile access like *(volatile char*)&glbXYZ; or *(volatile int*)&glbXYZ.field1. Especially if you have ProcessD running on the same physical core as ProcessB.

If prefetchT2 works, you could maybe do that in the thread that writes bDOIT (ProcessA), so it could trigger the migration of the line to L3 right before ProcessB will need it.

If you're finding that the line gets evicted before use, maybe you do want a thread spinning on fetching that cache line.

On future Intel CPUs, there's a cldemote instruction (_cldemote(const void*)) which you could use after writing to trigger migration of the dirty cache line to L3. It runs as a NOP on CPUs that don't support it, but it's only slated for Tremont (Atom) so far. (Along with umonitor/umwait to wake up when another core writes in a monitored range from user-space, which would probably also be super-useful for low latency inter-core stuff.)


Since ProcessA doesn't write the struct, you should probably make sure bDOIT is in a different cache line than the struct. You might put alignas(64) on the first member of XYZ so the struct starts at the start of a cache line. alignas(64) atomic<int> bDOIT; would make sure it was also at the start of a line, so they can't share a cache line. Or make it an alignas(64) atomic<bool> or atomic_flag.

Also see Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size1 : normally 128 is what you want to avoid false sharing because of adjacent-line prefetchers, but it's actually not a bad thing if ProcessB triggers the L2 adjacent-line prefetcher on core#2 to speculatively pull glbXYZ into its L2 cache when it's spinning on bDOIT. So you might want to group those into a 128-byte aligned struct if you're on an Intel CPU.

And/or you might even use a software prefetch if bDOIT is false, in processB. A prefetch won't block waiting for the data, but if the read request arrives in the middle of ProcessC writing glbXYZ then it will make that take longer. So maybe only SW prefetch every 16th or 64th time bDOIT is false?


And don't forget to use _mm_pause() in your spin loop, to avoid a memory-order mis-speculation pipeline nuke when the branch you're spinning on goes the other way. (Normally this is a loop-exit branch in a spin-wait loop, but that's irrelevant. Your branching logic is equivalent to outer infinite loop containing a spin-wait loop and then some work, even though that's not how you've written it.)

Or possibly use lock cmpxchg instead of a pure load to read the old value. Full barriers already block speculative loads after the barrier, so prevent mis-speculation. (You can do this in C11 with atomic_compare_exchange_weak with expected = desired. It takes expected by reference, and updates it if the compare fails.) But hammering on the cache line with lock cmpxchg is probably not helpful to ProcessA being able to commit its store to L1d quickly.

Check the machine_clears.memory_ordering perf counter to see if this is happening without _mm_pause. If it is, then try _mm_pause first, and then maybe try using atomic_compare_exchange_weak as a load. Or atomic_fetch_add(&bDOIT, 0), because lock xadd would be equivalent.


// GNU C11.  The typedef in your question looks like C, redundant in C++, so I assumed C.

#include <immintrin.h>
#include <stdatomic.h>
#include <stdalign.h>

alignas(64) atomic_bool bDOIT;
typedef struct { int a,b,c,d;       // 16 bytes
                 int e,f,g,h;       // another 16
} XYZ;
alignas(64) XYZ glbXYZ;

extern void doSomething(XYZ);

// just one object (of arbitrary type) that might be modified
// maybe cheaper than a "memory" clobber (compile-time memory barrier)
#define MAYBE_MODIFIED(x) asm volatile("": "+g"(x))

// suggested ProcessB
void ProcessB(void) {
    int prefetch_counter = 32;  // local that doesn't escape
    while(1){
        if (atomic_load_explicit(&bDOIT, memory_order_acquire)){
            MAYBE_MODIFIED(glbXYZ);
            XYZ localxyz = glbXYZ;    // or maybe a seqlock_read
  //        MAYBE_MODIFIED(glbXYZ);  // worse code from clang, but still good with gcc, unlike a "memory" clobber which can make gcc store localxyz separately from writing it to the stack as a function arg

  //          asm("":::"memory");   // make sure it finishes reading glbXYZ instead of optimizing away the copy and doing it during doSomething
            // localxyz hasn't escaped the function, so it shouldn't be spilled because of the memory barrier
            // but if it's too big to be passed in RDI+RSI, code-gen is in practice worse
            doSomething(localxyz);
        } else {

            if (0 == --prefetch_counter) {
                // not too often: don't want to slow down writes
                __builtin_prefetch(&glbXYZ, 0, 3);  // PREFETCHT0 into L1d cache
                prefetch_counter = 32;
            }

            _mm_pause();       // avoids memory order mis-speculation on bDOIT
                               // probably worth it for latency and throughput
                               // even though it pauses for ~100 cycles on Skylake and newer, up from ~5 on earlier Intel.
        }

    }
}

This compiles nicely on Godbolt to pretty nice asm. If bDOIT stays true, it's a tight loop with no overhead around the call. clang7.0 even uses SSE loads/stores to copy the struct to the stack as a function arg 16 bytes at a time.


Obviously the question is a mess of undefined behaviour which you should fix with _Atomic (C11) or std::atomic (C++11) with memory_order_relaxed. Or mo_release / mo_acquire. You don't have any memory barrier in the function that writes bDOIT, so it could sink that out of the loop. Making it atomic with memory-order relaxed has literally zero downside for the quality of the asm.

Presumably you're using a SeqLock or something to protect glbXYZ from tearing. Yes, asm("":::"memory") should make that work by forcing the compiler to assume it's been modified asynchronously. The "g"(glbXYZ) input the the asm statement is useless, though. It's global so the "memory" barrier already applies to it (because the asm statement could already reference it). If you wanted to tell the compiler that just it could have changed, use asm volatile("" : "+g"(glbXYZ)); without a "memory" clobber.

Or in C (not C++), just make it volatile and do struct assignment, letting the compiler pick how to copy it, without using barriers. In C++, foo x = y; fails for volatile foo y; where foo is an aggregate type like a struct. volatile struct = struct not possible, why?. This is annoying when you want to use volatile to tell the compiler that data may change asynchronously as part of implementing a SeqLock in C++, but you still want to let the compiler copy it as efficiently as possible in arbitrary order, not one narrow member at a time.


Footnote 1: C++17 specifies std::hardware_destructive_interference_size as an alternative to hard-coding 64 or making your own CLSIZE constant, but gcc and clang don't implement it yet because it becomes part of the ABI if used in an alignas() in a struct, and thus can't actually change depending on actual L1d line size.



来源:https://stackoverflow.com/questions/54209039/x86-mesi-invalidate-cache-line-latency-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!