发表新帖

发表新帖

memcpy performance differences between 32 and 64 bit processes

后端未结

关注

 7  1885

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy

相关标签:

7条回答

無奈伤痛

2020-12-15 00:28

Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题