Why does MSVC emit a useless MOVSX before performing this Bit Test?

北城以北 提交于 2019-11-28 09:08:56
Hans Passant

You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.

There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' ' by *s == 'A'.

Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.

There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.

As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.

Your code in C / C++:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

is equivalent to:

while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
    ++s;
}

or:

while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
   ++s;
}

The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test. In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:

#define ranged_bittest(pattern, value, min_val, max_val) \
  (val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)

while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
    ++s;
}

Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44); as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero). ranged_bittest(<BITPATTERN>, *s, 12, 44) => ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)

This evolves to:

while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
    ++s;
}

The val >= 0 && gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:

do {
  if (val > 44) break;
  if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;

  ++s;
} while (true);

For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).

Now in assembly the line

if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`

is translated to something like:

MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>

The assembly optimizer reduces:

MOV <reg64_B>, value
SUB <reg64_B>, 0

to just

MOVSX <reg64_B>, value

in your special case this is:

MOVSX rax, al

The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.

It can't therefore remove the movsx rax,al, as it doesn't have any meta information about the possible values of rax or al at this point of code.

I don't KNOW, if this is true, but this my best guess for this case...

What struck me most when I first saw this code is how poorly in fact it is optimized.
Yes it's a neat trick to use a 64 bit register for a lookup table but ...

  • Why still use INC as opposed to the better ADD,1 ?
  • Why CMP ... JA ... MOVSX ... when we know that the BT instruction uses its source operand MOD 64 if its destination operand is a register (So there are 58 bits that need not be considered)?
  • Why not benefit from the fact that conditional branches are predicted to go backwards?
  • Why not reduce the number of branches?

I think a true assembler programmer would have written it more like this :

  mov     rcx, 0FFFFEFFEFFFFDBFFh  ;~0000100100002400h
  sub     rsi, 1
  npad    1
$LL82@myFunc:
  add     rsi, 1
  movzx   eax, BYTE PTR [rsi]   ;mov al,[rsi]
  test    al, 11000000b
  setz    bl
  test    bl, bl
  bt      rcx, rax
  ja      SHORT $LL82@myFunc

ja jumps if (CF or ZF) = 0

  • ZF=1 for AL=[64,255] and don't care about CF
  • ZF=0 for AL=[0,63] and CF=0 for AL={10,13,32,44}

For all the ASCII's in the range [64,255] will the test al, 11000000b instruction give a non-zero result (ZF=0). Because the combo setz bl test bl, bl is then used to flip the ZF to 1, there's no longer any chance for the ja instruction to continue the loop.
On the contrary, for all the ASCII's in the range [0,63] will the ZF end up being 0, thus allowing ja to fully interpret the CF gotten from the bt rcx, rax instruction.

Perhaps we are expecting to much from our optimizing compilers?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!