How to profile the number of additions, mutltiplications etc. with vtune

懵懂的女人 提交于 2020-01-06 04:26:12

问题


I am able to profile my C++ library's instruction counts with Vtune using the 'INST_RETIRED.ANY' event.

What analysis types or events can be used profile in terms of number of integer/floating point additions, multiplications, divisions etc?


回答1:


(tl:dr): I don't think you can do everything you want with perf counters. See the end of this answer for a possible way using binary instrumentation

Also note that imul is not an expensive operation, and FP mul is barely more expensive than add. e.g. on Skylake, mulps, addps, and fma all have the same performance (throughput, latency, uops, and choice of execution ports). On pre-Skylake, add was lower latency but also half throughput, since there's a dedicated add unit.


It's not so much what VTUNE can do, as what the hardware performance counters can count. e.g. this table of perf-counter events from Linux oprofile came up when I searched for Sandybridge perf counters. Also this more-complete listing for Linux perf. If the hardware can count it, I assume VTUNE can show it to you, once you find the right name.

Test these counters on simple code with known behaviour, so to make sure they work the way you expect when you already know what the code is doing.

I only looked through what Sandybridge supports. I assume Haswell/Skylake have these events, too, and probably more. You didn't say what CPU you have, so I'm not going to check all of them.

Pre-SnB don't have nearly as wide a selection of perf counters, IIRC. Intel improved perf counters a lot in SnB, along with other big changes to the core. Big enough that it's generally considered a new microarchitecture family, separate from the P6 family (PPro-Nehalem).


I don't think you can distinguish integer add from integer mul, or FP add from FP mul. You can count FP activity, though: FP_COMP_OPS_EXE "Counts number of floating point events", with masks for x87 and {packed,scalar}{single,double}.

There's also SIMD_FP_256, which counts only 256b vector FP ops.

There's a counter for FP-assist events (when an FP operation needs to fall back to microcode to handle a denormal or something).

I'm not sure this is right, but the perf listing says there's a PARTIAL_RAT_STALLS with Umask-02 : 0x80: [MUL_SINGLE_UOP]: Number of Multiply packed/scalar single precision uops allocated. It's odd that there's not a similar double-precision counter. Or maybe mulss is somehow special in partial-register behaviour, with PARTIAL_RAT_STALLS has another sub-even to count partial-register merging uops.


divide (div / divps) is slow enough to be worth having a special counter, though: SnB's arith.fpu_div counter = "Number of times that the divider is actived, includes INT, SIMD and FP." There's also a counter for number of cycles the divider is active, rather than the number of times it was activated.


How to count instructions:

Intel's Pin is a dynamic binary instrumentation framework for the IA-32 and x86-64 instruction-set architectures that enables the creation of dynamic program analysis tools

I don't have VTUNE, but there may be ways to use Pin tools from within VTUNE. It will make your code run a slower, potentially a lot slower. I think it works by JIT-compiling from normal machine code to instrumented machine code, where the instrumentation is extra instructions to increment counters. It might have other modes of operation, more like single-stepping the original code and counting stuff along the way.



来源:https://stackoverflow.com/questions/36650210/how-to-profile-the-number-of-additions-mutltiplications-etc-with-vtune

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!