We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy
Here's an example of a memcpy routine geared specifically for 64 bit architecture.
void uint8copy(void *dest, void *src, size_t n){
uint64_t * ss = (uint64_t)src;
uint64_t * dd = (uint64_t)dest;
n = n * sizeof(uint8_t)/sizeof(uint64_t);
while(n--)
*dd++ = *ss++;
}//end uint8copy()
The full article is here: http://www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/