We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy
Thanks for the positive feedback! I think I can partly explain what's going here.
Using the non-temporal stores for memcpy is definitely the fasted if you're only timing the memcpy call.
On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.
So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.
But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.
So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.
Of course, none of this has anything to do with the differences between 32-bit and 64-bit.
My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.