simplest tool to measure C program cache hit/miss and cpu time in linux?

后端 未结 4 1438
渐次进展
渐次进展 2020-11-29 16:09

I\'m writing a small program in C, and I want to measure it\'s performance.

I want to see how much time do it run in the processor and how many cache hit+misses has

相关标签:
4条回答
  • 2020-11-29 16:51

    Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

    perf is likely what OP wants as shown at https://stackoverflow.com/a/10114325/895245 but just for completeness, I'm going to show how to do this from inside a C program if you control the source code.

    This method can allow for more precise measurements of a specific region of interest within the program. It can also get separate cache hit/miss counts for each different cache level. This syscall likely shares the same backend as perf.

    This example is basically the same as Quick way to count number of instructions executed in a C program but with PERF_TYPE_HW_CACHE. By doing:

    man perf_event_open
    

    you can see that in this examples we are counting only:

    • L1 data cache (PERF_COUNT_HW_CACHE_L1D)
    • reads (PERF_COUNT_HW_CACHE_OP_READ), not writes of prefetches
    • misses (PERF_COUNT_HW_CACHE_RESULT_MISS), not hits

    perf_event_open.c

    #define _GNU_SOURCE
    #include <asm/unistd.h>
    #include <linux/perf_event.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/ioctl.h>
    #include <sys/types.h>
    #include <sys/syscall.h>
    #include <unistd.h>
    
    #include <inttypes.h>
    
    static long
    perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                    int cpu, int group_fd, unsigned long flags)
    {
        int ret;
    
        ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                        group_fd, flags);
        return ret;
    }
    
    int
    main(int argc, char **argv)
    {
        struct perf_event_attr pe;
        long long count;
        int fd;
        char *chars, c;
    
        uint64_t n;
        if (argc > 1) {
            n = strtoll(argv[1], NULL, 0);
        } else {
            n = 10000;
        }
    
        chars = malloc(n * sizeof(char));
    
        memset(&pe, 0, sizeof(struct perf_event_attr));
        pe.type = PERF_TYPE_HW_CACHE;
        pe.size = sizeof(struct perf_event_attr);
        pe.config = PERF_COUNT_HW_CACHE_L1D |
                    PERF_COUNT_HW_CACHE_OP_READ << 8 |
                    PERF_COUNT_HW_CACHE_RESULT_MISS << 16;
        pe.disabled = 1;
        pe.exclude_kernel = 1;
        // Don't count hypervisor events.
        pe.exclude_hv = 1;
    
        fd = perf_event_open(&pe, 0, -1, -1, 0);
        if (fd == -1) {
            fprintf(stderr, "Error opening leader %llx\n", pe.config);
            exit(EXIT_FAILURE);
        }
    
        /* Write the memory to ensure misses later. */
        for (size_t i = 0; i < n; i++) {
            chars[i] = 1;
        }
    
        ioctl(fd, PERF_EVENT_IOC_RESET, 0);
        ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    
        /* Read from memory. */
        for (size_t i = 0; i < n; i++) {
            c = chars[i];
        }
    
        ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
        read(fd, &count, sizeof(long long));
    
        printf("%lld\n", count);
    
        close(fd);
        free(chars);
    }
    

    With this, I get results increasing linearly like:

    ./main.out 100000
    # 1565
    ./main.out 1000000
    # 15632
    ./main.out 10000000
    # 156641
    

    From this we can estimate a cache line size of: 100000/1565 ~ 63.9 which almost exactly matches the exact value of 64 according to getconf LEVEL1_DCACHE_LINESIZE on my computer, so I guess it is working.

    0 讨论(0)
  • 2020-11-29 17:02

    Use perf:

    perf stat ./yourapp
    

    See the kernel wiki perf tutorial for details. This uses the hardware performance counters of your CPU, so the overhead is very small.

    Example from the wiki:

    perf stat -B dd if=/dev/zero of=/dev/null count=1000000
    
    Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
    
            5,099 cache-misses             #      0.005 M/sec (scaled from 66.58%)
          235,384 cache-references         #      0.246 M/sec (scaled from 66.56%)
        9,281,660 branch-misses            #      3.858 %     (scaled from 33.50%)
      240,609,766 branches                 #    251.559 M/sec (scaled from 33.66%)
    1,403,561,257 instructions             #      0.679 IPC   (scaled from 50.23%)
    2,066,201,729 cycles                   #   2160.227 M/sec (scaled from 66.67%)
              217 page-faults              #      0.000 M/sec
                3 CPU-migrations           #      0.000 M/sec
               83 context-switches         #      0.000 M/sec
       956.474238 task-clock-msecs         #      0.999 CPUs
    
       0.957617512  seconds time elapsed
    

    No need to load a kernel module manually, on a modern debian system (with the linux-base package) it should just work. With the perf record -a / perf report combo you can also do full-system profiling. Any application or library that has debugging symbols will show up with details in the report.

    For visualization flame graphs seem to work well. (Update 2020: the hotspot UI has flame graphs integrated.)

    0 讨论(0)
  • 2020-11-29 17:02

    You can also use

    /usr/bin/time -v YourProgram.exe
    

    It will show you all this information:

    /usr/bin/time -v ls
        Command being timed: "ls"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 60%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4080
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 314
        Voluntary context switches: 1
        Involuntary context switches: 1
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
    

    You can also use the -f flag to format the output to fit your needs.

    Please, be sure to call this program using it's full path, otherway it will call the 'time' command and that's not what you need...

    Hope this helps!

    0 讨论(0)
  • 2020-11-29 17:13

    The best tool for you is called valgrind. It is capable of memory profiling, call-graph building and much more.

    sudo apt get install valgrind
    valgrind ./yourapp
    

    However, to obtain the time your program executed, you can use time(8) linux utility.

    time ./yourapp
    
    0 讨论(0)
提交回复
热议问题