Performance difference between two seemingly equivalent assembly codes

倖福魔咒の 提交于 2019-12-23 19:49:41
问题 tl;dr : I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes. I have the following C code (ignore #include "iacaMarks.h" , IACA_START , IACA_END for now): ref.c: #include "iacaMarks.h" #include <x86intrin.h> #define AND(a,b) _mm_and_si128(a,b) #define OR

What is IACA and how do I use it?

我的梦境 提交于 2019-12-16 19:58:25
问题 I've found this interesting and powerful tool called IACA (the Intel Architecture Code Analyzer), but I have trouble understanding it. What can I do with it, what are its limitations and how can I: Use it to analyze code in C or C++? Use it to analyze code in x86 assembler? 回答1: 2019-04 : Reached EOL . Suggested alternative : LLVM-MCA 2017-11 : Version 3.0 released (latest as of 2019-05-18) 2017-03 : Version 2.3 released What it is: IACA (the Intel Architecture Code Analyzer) is a ( 2019: end

Micro fusion and addressing modes

北慕城南 提交于 2019-11-25 23:56:42
问题 I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA). The following instruction using [base+index] addressing addps xmm1, xmmword ptr [rsi+rax*1] does not micro-fuse according to IACA. However, if I use [base+offset] like this addps xmm1, xmmword ptr [rsi] IACA reports that it does fuse. Section 2-11 of the Intel optimization reference manual gives the following as an example \"of micro-fused micro-ops that can be handled by all decoders\" FADD DOUBLE