Strange JIT pessimization of a loop idiom

你说的曾经没有我的故事 提交于 2019-11-28 09:03:58

The reason why the JITC tries to group everything together is rather unclear to me. AFAIK there are (were?) architectures for which grouping of two loads leads to a better performance (some early Pentium, I think).

As JITC knows the hot spots, it can inline much more aggressively than an ahead-of-time compiler, so it does it 16 times in this case. I can't see any clear advantage thereof here, except for making the looping relatively cheaper. I also doubt that there's any architecture profiting from grouping 16 loads together.

The code computes 16 temporary values, one per iteration of

int j = i & array.length-1;
int entry = array[i];
int tmp = entry + j;
result ^= tmp;

Each computation is pretty trivial, one AND, one LOAD, and one ADD. The values are to be mapped to registers, but there aren't enough of them. So the values have to be stored and loaded later.

This happens for 7 out of the 16 registers and increases the costs significantly.

Update

I'm not very sure about verifying this by using -XX:LoopUnrollLimit:

LoopUnrollLimit Benchmark   Mean   Mean error    Units

 8 ..normalIndex           0.902        0.004    ns/op
 8 ..normalWithExitPoint   0.913        0.005    ns/op
 8 ..maskedIndex           0.918        0.006    ns/op
 8 ..maskedWithExitPoint   0.996        0.008    ns/op

16 ..normalIndex           0.769        0.003    ns/op
16 ..normalWithExitPoint   0.930        0.004    ns/op
16 ..maskedIndex           0.937        0.004    ns/op
16 ..maskedWithExitPoint   1.012        0.003    ns/op

32 ..normalIndex           0.814        0.003    ns/op
32 ..normalWithExitPoint   0.816        0.005    ns/op
32 ..maskedIndex           0.838        0.003    ns/op
32 ..maskedWithExitPoint   0.978        0.002    ns/op

 - ..normalIndex           0.830        0.002    ns/op
 - ..normalWithExitPoint   0.683        0.002    ns/op
 - ..maskedIndex           0.791        0.005    ns/op
 - ..maskedWithExitPoint   0.908        0.003    ns/op

The limit of 16 makes normalIndex be the fastest variant, which indicates that I was right with the "overallocation penalty". Bur according to Marko, the generated assembly changes with the unroll limit also in other aspects, so things are more complicated.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!