Apparently MSVC++2017 toolset v141 (x64 Release configuration) doesn\'t use FYL2X
x86_64 assembly instruction via a C/C++ intrinsic, but rather C++ log()<
Here is the assembly code using FYL2X
:
_DATA SEGMENT
_DATA ENDS
_TEXT SEGMENT
PUBLIC SRLog2MulD
; XMM0L=toLog
; XMM1L=toMul
SRLog2MulD PROC
movq qword ptr [rsp+16], xmm1
movq qword ptr [rsp+8], xmm0
fld qword ptr [rsp+16]
fld qword ptr [rsp+8]
fyl2x
fstp qword ptr [rsp+8]
movq xmm0, qword ptr [rsp+8]
ret
SRLog2MulD ENDP
_TEXT ENDS
END
The calling convention is according to https://docs.microsoft.com/en-us/cpp/build/overview-of-x64-calling-conventions , e.g.
The x87 register stack is unused. It may be used by the callee, but must be considered volatile across function calls.
The prototype in C++ is:
extern "C" double __fastcall SRLog2MulD(const double toLog, const double toMul);
The performance is 2 times slower than std::log2()
and more than 3 times slower than std::log()
:
Log2: 94803174.389 Ops/sec calculated 2513272986.435
FPU Log2: 52008300.525 Ops/sec calculated 2513272986.435
Ln: 169392473.892 Ops/sec calculated 1742068084.525
The benchmarking code is as follows:
void BenchmarkFpuLog2() {
double sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 1; i <= cnLogs; i++) {
sum += SRPlat::SRLog2MulD(double(i), 1);
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast(elapsed).count();
printf("FPU Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}