Measuring Cache Latencies

前端 未结 5 1662
感情败类
感情败类 2020-11-28 20:16

So I am trying to measure the latencies of L1, L2, L3 cache using C. I know the size of them and I feel I understand conceptually how to do it but I am running into problems

5条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-28 20:48

    Widely used classic test for cache latency is iterating over the linked list. It works on modern superscalar/superpipelined CPU and even on Out-of-order cores like ARM Cortex-A9+ and Intel Core 2/ix. This method is used by open-source lmbench - in the test lat_mem_rd (man page) and in CPU-Z latency measurement tool: http://cpuid.com/medias/files/softwares/misc/latency.zip (native Windows binary)

    There are sources of lat_mem_rd test from lmbench: https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c

    And the main test is

    #define ONE p = (char **)*p;
    #define FIVE    ONE ONE ONE ONE ONE
    #define TEN FIVE FIVE
    #define FIFTY   TEN TEN TEN TEN TEN
    #define HUNDRED FIFTY FIFTY
    
    void
    benchmark_loads(iter_t iterations, void *cookie)
    {
        struct mem_state* state = (struct mem_state*)cookie;
        register char **p = (char**)state->p[0];
        register size_t i;
        register size_t count = state->len / (state->line * 100) + 1;
    
        while (iterations-- > 0) {
            for (i = 0; i < count; ++i) {
                HUNDRED;
            }
        }
    
        use_pointer((void *)p);
        state->p[0] = (char*)p;
    }
    

    So, after deciphering the macro we do a lot of linear operations like:

     p = (char**) *p;  // (in intel syntax) == mov eax, [eax]
     p = (char**) *p;
     p = (char**) *p;
     ....   // 100 times total
     p = (char**) *p;
    

    over the memory, filled with pointers, every pointing stride elements forward.

    As says the man page http://www.bitmover.com/lmbench/lat_mem_rd.8.html

    The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by

     p = (char **)*p;
    

    in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads. The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.

    More detailed description with examples on POWERs is available from IBM's wiki: Untangling memory access measurements - lat_mem_rd - by Jenifer Hopper 2013

    The lat_mem_rd test (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) takes two arguments, an array size in MB and a stride size. The benchmark uses two loops to traverse through the array, using the stride as the increment by creating a ring of pointers that point forward one stride. The test measures memory read latency in nanoseconds for the range of memory sizes. The output consists of two columns: the first is the array size in MB (the floating point value) and the second is the load latency over all the points of the array. When the results are graphed, you can clearly see the relative latencies of the entire memory hierarchy, including the faster latency of each cache level, and the main memory latency.

    PS: There is paper from Intel (thanks to Eldar Abusalimov) with examples of running lat_mem_rd: ftp://download.intel.com/design/intarch/PAPERS/321074.pdf - sorry right url is http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-cache-latency-bandwidth-paper.pdf "Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel Architecture" by Joshua Ruggiero from December 2008:

提交回复
热议问题