how to optimise double dereferencing?

爷,独闯天下 提交于 2019-12-24 00:34:02

问题


Very specific optimisation task. I have 3 arrays:

  • const char* inputTape
  • const int* inputOffset, organised in a group of four
  • char* outputTapeoutput

which i must assemble output tape from input, according to following 5 operations:

int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte

and then advance counter.

All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.

OpenCL on GPU fails on this algorithm (works same or even slower that cpu)

Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd: thanks to Peter Cordes little hint

loop_start:
mov         eax,dword ptr [rdx-10h]             ; selector offset
movzx       r10d,byte ptr [rax+r8]          ; selector value
mov         eax,dword ptr [rdx+r10*4-0Ch]       ; output offset
movzx       r10d,byte ptr [r8+rax]          ; output value
mov         byte ptr [r9+rcx-1],r10b            ; store to outputTape
lea         rdx, [rdx-10h]                  ; pointer to inputOffset for current 
dec         ecx                             ; loop counter, sets zero flag if (ecx == 0)
jne         loop_start                      ; continue looping while non zero iterations left: ( ecx != 0 )

How could i optimise this for SSE/AVX operation? i am stumbled...

UPD: better to see it than to hear it..

来源:https://stackoverflow.com/questions/48372852/how-to-optimise-double-dereferencing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!