Why does TZCNT work for my Sandy Bridge processor?

99封情书 提交于 2020-06-27 08:17:19

问题


I'm running a Core i7 3930k, which is of the Sandy Bridge microarchitecture. When executing the following code (compiled under MSVC19, VS2015), the results surprised me (see in comments):

int wmain(int argc, wchar_t* argv[])
{
    uint64_t r = 0b1110'0000'0000'0000ULL;
    uint64_t tzcnt = _tzcnt_u64(r);
    cout << tzcnt << endl; // prints 13

    int info[4]{};
    __cpuidex(info, 7, 0);
    int ebx = info[1];
    cout << bitset<32>(ebx) << endl; // prints 32 zeros (including the bmi1 bit)

    return 0;
}

Disassembly shows that the tzcnt instruction is indeed emitted from the intrinsic:

    uint64_t r = 0b1110'0000'0000'0000ULL;
00007FF64B44877F 48 C7 45 08 00 E0 00 00 mov         qword ptr [r],0E000h  
    uint64_t tzcnt = _tzcnt_u64(r);
00007FF64B448787 F3 48 0F BC 45 08    tzcnt       rax,qword ptr [r]  
00007FF64B44878D 48 89 45 28          mov         qword ptr [tzcnt],rax  

How come I'm not getting an #UD invalid opcode exception, the instruction functions correctly, and the CPU reports that it does not support the aforementioned instruction?

Could this be some weird microcode revision that contains an implementation for the instruction but doesn't report support for it (and others included in bmi1)?

I haven't checked the rest of the bmi1 instructions, but I'm wondering how common a phenomenon this is.


回答1:


The reason that Sandy Bridge (and earlier) processors seem to support lzcnt and tzcnt is that both instructions have a backward compatible encoding.

lzcnt eax,eax  = rep bsr eax,eax
tzcnt eax,eax  = rep bsf eax,eax

On older processors the rep prefix is simply ignored.

So much for the good news.
The bad news is that the semantics of both versions are different.

lzcnt eax,zero => eax = 32, CF=1, ZF=0  
bsr eax,zero   => eax = undefined, ZF=1
lzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of msb leading zeros
bsr eax,0xFFFFFFFF => eax=31, ZF=0        //dest = bit index of highest set bit


tzcnt eax,zero => eax = 32, CF=1, ZF=0
bsf eax,zero   => eax = undefined, ZF=1
tzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of lsb trailing zeros
bsf eax,0xFFFFFFFF => eax=0, ZF=0        //dest = bit index of lowest set bit

At least bsf and tzcnt generate the same output when source <> 0. bsr and lzcnt do not agree on that.
Also lzcnt and tzcnt execute much faster than bsr/bsf.
It totally sucks that bsf and tzcnt cannot agree on the flag usage. This needless inconsistancy means that I cannot use tzcnt as a drop-in replacement for bsf unless I can be sure its source is non-zero.



来源:https://stackoverflow.com/questions/43880227/why-does-tzcnt-work-for-my-sandy-bridge-processor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!