memcpy performance differences between 32 and 64 bit processes

后端 未结 7 1875
梦如初夏
梦如初夏 2020-12-14 23:43

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy

相关标签:
7条回答
  • 2020-12-15 00:28

    Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

    My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

    If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

    I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

    Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

    0 讨论(0)
提交回复
热议问题