Fully optimized memcpy/memmove for Core 2 or Core i7 architecture?

风流意气都作罢 提交于 2019-11-30 12:43:59

If you specify /ARCH:SSE2 to MSVC it should provide you with a tuned memcpy (at least, mine does).

Failing that, use the SSE aligned load/store intrinsics yourself to copy the memory in large chunks, employing a Duff's Device of word reads where necessary to deal with the head and tail of data to get it to an aligned boundary. You'll need to use the cache management intrinsics as well to get good performance.

Your limiting factor is probably cache misses and southbridge bandwidth, rather than CPU cycles. Given that there's always going to be lots of other traffic on the memory bus, I'm usually happy to get to about 90% of theoretical memory bandwidth throughput in such operations.

When measuring bandwidth did you take into account memcpy was both a read and a write, so 3 GB/s of memory copied is actually 6 GB/s of bandwidth?

Remember, the bandwidth is theoretical maximum - real world use will be much lower. For instance, one page fault and your bandwidth will drop to MB/s.

memcpy/memmove are compiler intrinsics and will usually be inlined to rep movsd (or the appropriate SSE instructions if your compiler can target that). It may be impossible to improve the codegen over this, since modern CPU's will handle rep instructions like this very, very well.

You could write your own. Try using the intel optimising compiler to directly target the architecture?

Intel also produce something called VTune (compiler and language independent) for optimising applications.

Here's an article on optimising a game engine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!