How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

前端 未结 1 542
孤街浪徒
孤街浪徒 2020-12-20 02:25

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

I\'m interested in all of the following cases:

  • full sy

相关标签:
1条回答
  • 2020-12-20 02:46

    m5 tool

    A good approximation is to run, ideally from a shell script that is the /init program:

    m5 resetstats
    run-benchmark
    m5 dumpstats
    

    Then on host:

    grep -E '^system.cpu.numCycles ' m5out/stats.txt
    

    Gives something like:

    system.cpu.numCycles                      33942872680                       # number of cpu cycles simulated
    

    Note that if you replay from a m5 checkpoint with a different CPU, e.g.:

    --restore-with-cpu=HPI --caches
    

    then you need to grep for a different identifier:

    grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
    

    resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.

    This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.

    http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:

    #!/bin/sh
    # Wait for system to calm down
    sleep 10
    # Take a checkpoint in 100000 ns
    m5 checkpoint 100000
    # Reset the stats
    m5 resetstats
    run-benchmark
    # Exit the simulation
    m5 exit
    

    m5 exit also works since GEM5 dumps stats when it finishes.

    Instrumentation instructions

    Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:

    • skip initialization and go directly to steady state
    • evaluate individual main loop runs

    You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:

    /* resetstats */
    __asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
    /* dumpstats */
    __asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
    

    The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).

    To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++

    Address monitoring

    Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.

    E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.

    To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.

    This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.

    0 讨论(0)
提交回复
热议问题