I done computing intensive app using OpenCV
for iOS
. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I w
I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:
__asm__("#start");
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
__asm__("#stop");
Now I press "Product -> Generate Output -> Assembly file" and what I get is:
@ InlineAsm Start
#start
@ InlineAsm End
Ltmp1915:
ldr r0, [sp, #84]
movs r1, #0
ldr r0, [r0, #16]
ldr r0, [r0, #28]
cmp r0, #4
mov r0, r4
blo LBB14_71
LBB14_70:
Ltmp1916:
ldr r3, [sp, #84]
movs r2, #0
Ltmp1917:
str r2, [r0], #4
adds r1, #1
Ltmp1918:
Ltmp1919:
ldr r2, [r3, #16]
ldr r2, [r2, #28]
lsrs r2, r2, #2
cmp r2, r1
bgt LBB14_70
LBB14_71:
Ltmp1920:
add.w r0, r4, #8
@ InlineAsm Start
#stop
@ InlineAsm End
A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0]))
and it was always 64. So I hardcoded it to be 64 and passed again via assembler:
@ InlineAsm Start
#start
@ InlineAsm End
Ltmp1915:
vldr.32 s16, LCPI14_7
mov r0, r4
movs r1, #0
mov.w r2, #256
blx _memset
@ InlineAsm Start
#stop
@ InlineAsm End
As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.
I also tried Accelerate as suggested replacing three lines with:
__asm__("#start");
vDSP_vclr(vec,1,64);
__asm__("#stop");
Assembly now looks:
@ InlineAsm Start
#start
@ InlineAsm End
Ltmp1917:
str r1, [r7, #-140]
Ltmp1459:
Ltmp1918:
movs r1, #1
movs r2, #64
blx _vDSP_vclr
Ltmp1460:
Ltmp1919:
add.w r0, r4, #8
@ InlineAsm Start
#stop
@ InlineAsm End
Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.
One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework