CPU cache critical stride test giving unexpected results based on access type

后端 未结 3 1341
野趣味
野趣味 2021-02-02 15:40

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I\'d spend some time to learn more about CPU caching a

3条回答
  •  青春惊慌失措
    2021-02-02 16:19

    First of all, there's a small clarification that needs to be made - in most cases, a write would still require you to fetch the line into the local cache, since the lines are usually 64Byte and your write might only modify a partial chunk of that - the merge will be made in the cache. Even if you were to write the whole line in one go (which could in theory be possible in some cases), you would still need to wait for the access in order to receive ownership of the line before writing to it - this protocol is called RFO (read for ownership), and it could be quite long, especially if you have a multi-socket system or anything with complicated memory hierarchy.

    Having that said, your 4th assumption may still be correct in some cases, since a load operation will indeed require the data to be fetched before the program advances, while a store can be buffered to write later on when possible. However, the load will only stall the program if it's in some critical path (meaning that some other operation waits for its result), a behavior which your test program doesn't exercise. Since most modern CPUs offer out-of-order execution, the following independent instructions are free to go without waiting for the load to complete. In your program, there's no inter-loop dependency except for the simple index advance (which can run ahead easily), so you're basically not bottlenecked on memory latency but rather on memory throughput, which is a totally different thing. By the way, to add such dependency, you could emulate linked-list traversal, or even simpler - make sure that the array is initialized to zero (and switch the writes to zeros only), and add the content of each read value to the index on each iteration (in addition to the increment) - this would create a dependency without changing the addresses themselves. Alternatively, do something nasty like this (assuming that the compiler isn't smart enough to drop this...):

        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
            index += buffer[index];
            index -= buffer[index];
        }
    

    Now, about the results, it seems that the write vs read+write behave the same when you're jumping by the critical step, as expected (since the read doesn't differ much from the RFO that would be issued by the write anyway). However, for the non-critical step the read+write operation is much slower. Now it's hard to tell without knowing the exact system, but this could happen due to the fact that loads (reads) and stores (writes) are not performed at the same stage in the lifetime of an instructions - this means that between the load and the store that follows, you may already have evicted the line and need to fetch it again a second time. I'm not too sure about that, but if you want to check, maybe you could add an sfence assembly instruction between the iterations (although that would slow you down significantly).

    One last note - when you are bandwidth limited, writing can slow you down quite a bit due to another requirement - when you write to memory you fetch a line to the cache and modify it. Modified lines need to be written back to memory (although in reality there's a whole set of lower level caches on the way), which requires resources and can clog down your machine. Try a read-only loop and see how it goes.

提交回复
热议问题