问题
I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that.
回答1:
Is ZeroMemory()
or memset()
not good enough?
Disclaimer: Some of the following may be SSE3.
- Fill any unaligned leading bytes by looping until the address is a multiple of 16
push
to save an xmm regpxor
to zero the xmm reg- While the remaining length >= 16,
movdqa
ormovntdq
to do the write
pop
to restore the xmm reg.- Fill any unaligned trailing bytes.
movntdq
may appear to be faster because it tells the processor to not bring the data into your cache, but this can cause a performance penalty later if the data is going to be used. It may be more appropriate if you are scrubbing memory before freeing it (like you might do with SecureZeroMemory()
).
回答2:
I you want to speed up your code than you must exactly understand how your CPU works and where is the bottleneck.
Here you are my speed optimized routine just to show how should be made.
On my PC is about 5 time faster (clear 1MBytes mem block) than your, test it and ask if somethink isn't clear:
//edx = memory pointer must be 16 bytes aligned
//ecx = memory count must be multiple of 16
xorps xmm0, xmm0 //Clear xmm0
mov eax, ecx //Save ecx to eax
and ecx, 0FFFFFF80h //Clear only 128 byte pages
jz @ClearRest //Less than 128 bytes to clear
@Aligned128BMove:
movdqa [edx], xmm0 //Clear first 16 bytes of 128 bytes
movdqa [edx + 10h], xmm0 //Clear second 16 bytes of 128 bytes
movdqa [edx + 20h], xmm0 //...
movdqa [edx + 30h], xmm0
movdqa [edx + 40h], xmm0
movdqa [edx + 50h], xmm0
movdqa [edx + 60h], xmm0
movdqa [edx + 70h], xmm0
add edx, 128 //inc mem pointer
sub ecx, 128 //dec counter
jnz @Aligned128BMove
@ClearRest:
and eax, 07Fh //Clear the rest
jz @Exit
@LoopRest:
movdqa [edx], xmm0
add edx, 16
sub eax, 16
jnz @LoopRest
@Exit:
回答3:
Almost all of the transistors in your CPU are used to somehow make memory access as fast as possible. The CPU is already doing an amazing job at all memory accesses, and the instructions run at a drastically faster rate than possible memory accesses.
Therefore, trying to beat memset is a mostly futile exercise in most cases because it is already limited by the speed of your memory (as mentioned by others).
来源:https://stackoverflow.com/questions/12786893/zeromemory-in-sse