I have a pointer to an array of bytes mixed that contains the interleaved bytes of two distinct arrays array1 and array2. Say mi
premature optimisation is bad
your compiler is probably better at optimising than you are.
That said, there are things you can do to help out the compiler because you have semantic knowledge of your data that a compiler cannot have:
read and write as many bytes as you can, up to the native word size - memory operations are expensive, so do manipulations in registers where possible
unroll loops - look into "Duff's Device".
FWIW, I produced two versions of your copy loop, one much the same as yours, the second using what most would consider "optimal" (albeit still simple) C code:
void test1(byte *p, byte *p1, byte *p2, int n)
{
int i, j;
for (i = 0, j = 0; i < n / 2; i++, j += 2) {
p1[i] = p[j];
p2[i] = p[j + 1];
}
}
void test2(byte *p, byte *p1, byte *p2, int n)
{
while (n) {
*p1++ = *p++;
*p2++ = *p++;
n--; n--;
}
}
With gcc -O3 -S on Intel x86 they both produced almost identical assembly code. Here are the inner loops:
LBB1_2:
movb -1(%rdi), %al
movb %al, (%rsi)
movb (%rdi), %al
movb %al, (%rdx)
incq %rsi
addq $2, %rdi
incq %rdx
decq %rcx
jne LBB1_2
and
LBB2_2:
movb -1(%rdi), %al
movb %al, (%rsi)
movb (%rdi), %al
movb %al, (%rdx)
incq %rsi
addq $2, %rdi
incq %rdx
addl $-2, %ecx
jne LBB2_2
Both have the same number of instructions, the difference accounted for solely because the first version counts up to n / 2, and the second counts down to zero.
EDIT here's a better version:
/* non-portable - assumes little endian */
void test3(byte *p, byte *p1, byte *p2, int n)
{
ushort *ps = (ushort *)p;
n /= 2;
while (n) {
ushort n = *ps++;
*p1++ = n;
*p2++ = n >> 8;
}
}
resulting in:
LBB3_2:
movzwl (%rdi), %ecx
movb %cl, (%rsi)
movb %ch, (%rdx) # NOREX
addq $2, %rdi
incq %rsi
incq %rdx
decq %rax
jne LBB3_2
which is one fewer instruction because it takes advantage of the immediate access to %cl and %ch.