Measuring Cache Latencies

前端 未结 5 1653
感情败类
感情败类 2020-11-28 20:16

So I am trying to measure the latencies of L1, L2, L3 cache using C. I know the size of them and I feel I understand conceptually how to do it but I am running into problems

5条回答
  •  北海茫月
    2020-11-28 20:40

    Ok, several issues with your code:

    1. As you mentioned, your measurement are taking a long time. In fact, they're very likely to take way longer than the single access itself, so they're not measuring anything useful. To mitigate that, access multiple elements, and amortize (divide the overall time by the number of accesses. Note that to measure latency, you want these accesses to be serialized, otherwise they can be performed in parallel and you'll only measure the throughput of unrelated accesses. To achieve that you could just add a false dependency between accesses.

      For e.g., initialize the array to zeros, and do:

      clock_gettime(CLOCK_REALTIME, &startAccess); //start clock
      for (int i = 0; i < NUM_ACCESSES; ++i) {
          int tmp = arrayAccess[index];                             //Access Value from Main Memory
          index = (index + i + tmp) & 1023;   
      }
      clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
      

      .. and of course remember to divide the time by NUM_ACCESSES.
      Now, i've made the index intentionally complicated so that you avoid a fixed stride which might trigger a prefetcher (a bit of an overkill, you're not likely to notice an impact, but for the sake of demonstration...). You could probably settle for a simple index += 32, which would give you strides of 128k (two cache lines), and avoid the "benefit" of most simple adjacent line/ simple stream prefetchers. I've also replaced the % 1000 with & 1023 since & is faster, but it needs to be power of 2 to work the same way - so just increase ACCESS_SIZE to 1024 and it should work.

    2. Invalidating the L1 by loading something else is good, but the sizes look funny. You didn't specify your system but 256000 seems pretty big for an L1. An L2 is usually 256k on many common modern x86 CPUs for e.g. Also note that 256k is not 256000, but rather 256*1024=262144. Same goes for the second size: 1M is not 1024000, it's 1024*1024=1048576. Assuming that's indeed your L2 size (more likely an L3, but probably too small for that).

    3. Your invalidating arrays are of type int, so each element is longer than a single byte (most likely 4 byte, depending on system). You're actually invalidating L1_CACHE_SIZE*sizeof(int) worth of bytes (and the same goes for the L2 invalidation loop)

    Update:

    1. memset receives the size in bytes, your sizes are divided by sizeof(int)

    2. Your invalidation reads are never used, and may be optimized out. Try to accumulate the reads in some value and print it in the end, to avoid this possibility.

    3. The memset at the beginning is accessing the data as well, therefor your first loop is accessing data from the L3 (since the other 2 memsets were still effective in evicting it from L1+L2, although only partially due to the size error.

    4. The strides may be too small so you get two access to the same cacheline (L1 hit). Make sure they're spread enough by adding 32 elements (x4 bytes) - that's 2 cacheline, so you also won't get any adjacent cacheline prefetch benefits.

    5. Since NUM_ACCESSES is larger than ACCESS_SIZE, you're essentially repeating the same elements and would probably get L1 hits for them (so the avg time shifts in favor of L1 access latency). Instead try using the L1 size so you access the entire L1 (except for the skips) exactly once. For e.g. like this -

      index = 0;
      while (index < L1_CACHE_SIZE) {
          int tmp = arrayAccess[index];               //Access Value from L2
          index = (index + tmp + ((index & 4) ? 28 : 36));   // on average this should give 32 element skips, with changing strides
          count++;                                           //divide overall time by this 
      }
      

    don't forget to increase arrayAccess to L1 size.

    Now, with the changes above (more or less), I get something like this:

    L1 Cache Access 7.812500
    L2 Cache Acces 15.625000
    L3 Cache Access 23.437500
    

    Which still seems a bit long, but possibly because it includes an additional dependency on arithmetic operations

提交回复
热议问题