Its such a trivial thing, but when I first saw this code (by a fellow developer of mine) I was shocked because it is something I would have never thought of (comments added by me):
cglobal x264_sub8x8_dct_sse2, 3,3 ;3,3 means 3 arguments and 3 registers used
.skip_prologue:
call .8x4
add r0, 64 ;increment pointers
add r1, 4*FENC_STRIDE
add r2, 4*FDEC_STRIDE
.8x4:
SUB_DCT4 2x4x4W ;this macro does the actual transform
movhps [r0+32], m0 ;store second half of output data
movhps [r0+40], m1 ;the rest is done in the macro
movhps [r0+48], m2
movhps [r0+56], m3
ret
It does an 8x8 block of 4 transforms by doing sets of 8x4 at a time. But it doesn't paste the code twice (that would waste code size), nor does it have an 8x4 function and call it twice. Nor does it have a loop either. Instead, it calls the "function" and then increments the pointers, and then "falls" right into it and does it again.
It gets the best of both worlds: no function calling overhead beyond the original (since the pointers r0, r1, and r2 aren't incremented in SUB_DCT4) and no code duplication, and no loop overhead.