How do I determine the number of x86 machine instructions executed in a C program?

前端 未结 4 1752
感情败类
感情败类 2020-11-27 22:47

I\'m currently working on a homework problem that asks me to find out the number of machine code instructions that are executed when running a short program I wrote in C.

4条回答
  •  孤城傲影
    2020-11-27 23:24

    Terminology: what you're asking for is dynamic instruction count. e.g. counting an instruction inside a loop every time it's executed. This is usually roughly correlated with performance, but instructions-per-cycle can vary wildly.

    • How many CPU cycles are needed for each assembly instruction?
    • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

    Something people also look at is static instruction count (or more usually just code-size, because that's what really matters for instruction-cache footprint, and disk-load times). For variable-length instruction sets like x86, those are correlated but not the same thing. On a RISC with fixed-length instructions, like MIPS or AArch64, it's closer but you still have padding for alignment of the start of functions, for example. That's a totally separate metric. gcc -Os optimizes for code-size while trying not to sacrifice to much speed.


    If you're on Linux, use gcc -O2 foo.c to compile your code. -O2 doesn't enable auto-vectorization for gcc. (It does for clang). It's probably a good baseline level of optimization that will get rid of stuff in your C code that doesn't actually need to happen, to avoid silly differences between using more or fewer tmp variables to break up a big expression. Maybe use -Og if you want minimal optimization, or -O0 if you want really dumb braindead code that compiles each statement separately and never keeps anything in registers between statements. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).

    Yes, it matters a huge amount how you compile. gcc -O3 -march=native -ffast-math might use a lot fewer instructions, if it auto-vectorizes a loop.

    To stop your code from optimizing away, take an input from a command-line arg, or read it from a volatile variable. Like volatile int size_volatile = 1234; int size = size_volatile;. And return or print a result, because if the program has no side-effects then the most efficient implementation is to just exit immediately.


    Then run perf stat ./a.out. That will use hardware performance counters to give you total instructions executed on behalf of your process, including inside the kernel. (Along with other counters, like CPU core clock cycles, and some software counters like page-faults and time in microseconds.)

    To count only user-space instructions, use perf stat -e instructions:u ./a.out. That will still be a very big number even for a simple "hello world" program, like 180k, because that includes dynamic-linker startup and all the code that runs inside library functions. And CRT startup code that calls your main, and that makes an exit system call with main's return value, if you return instead of calling exit(3).

    You might statically link your C program to reduce that startup overhead, by compiling with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie

    perf counting instructions:u seems to be pretty accurate on my Skylake CPU. A statically-linked x86-64 binary that contains only 2 instructions, mov eax, 231 / syscall, is counted as 3 instructions. Probably there's one extra instruction being counted in the transition between kernel and user mode, but that's pretty minor.

    $ perf stat -e instructions:u ./exit    # hand-written in asm to check for perf overhead
    
     Performance counter stats for './exit':
    
                     3      instructions:u                                              
    
           0.000651529 seconds time elapsed
    

    A statically-linked binary that calls puts twice counts 33,202 instructions:u, compiled with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c. Seems reasonable for glibc init functions, including stdio, and CRT startup stuff before calling main. (main itself only has 8 instructions, which I checked with objdump -drwC -Mintel a.out | less).


    Other resources:

    • Number of executed Instructions different for Hello World program Nasm Assembly and C

      @MichaelPetch's answer shows how to use an alternate libc (MUSL) that doesn't need startup code to run for its printf to work. So you can compile a C program and set its main as the ELF entry point (and call _exit() instead of returning).

    • How can I profile C++ code running on Linux? There are tons of profiling tools for finding hotspots, and expensive functions (including the time spent in functions they call, i.e. stack backtrace profiling). Mostly this isn't about counting instructions, though.


    Binary instrumentation tools:

    These are the heavy duty tools for counting instructions, including counting only specific kinds of instructions.

    • Intel Pin - A Dynamic Binary Instrumentation Tool
    • Intel® Software Development Emulator (SDE) This is based on PIN, and is handy for things like testing AVX512 code on a dev machine that doesn't support AVX512. (It dynamically recompiles so most instructions run natively, but unsupported instructions call an emulation routine.)

      For example, sde64 -mix -- ./my_program will print an instruction-mix for your program, with total counts for each different instruction, and breakdowns by categories. See libsvm compiled with AVX vs no AVX for an example of the kind of output.

      It also gives you a table of total dynamic instruction counts per-function, as well as per-thread and global. SDE mix output doesn't work well on PIE executable, though: it thinks the dynamic linker is the executable (because it is), so compile with gcc -O2 -no-pie -fno-pie prog.c -o prog. It still doesn't see the puts calls or main itself in the profile output for a hello world test program, though, and I don't know why not.

    • Calculating “FLOP” using Intel® Software Development Emulator (Intel® SDE) An example of using SDE to count certain kinds of instructions, like vfmadd231pd.

      Intel CPUs have HW perf counters for events like fp_arith_inst_retired.256b_packed_double, so you can use those to count FLOPs instead. They actually count FMA as 2 events. So if you have an Intel CPU that can run your code natively, you can do that instead with perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double. (And/or the events for single-precision.)

      But there aren't events for most other specific kinds of instructions, only FP math.

    This is all Intel stuff; IDK what AMD has, or any stuff for ISAs other than x86. These are just the tools I've heard of; I'm sure there are lots of things I'm leaving out.

提交回复
热议问题