How to increase performance of memcpy

后端未结

关注

 8  2099

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full d

相关标签:

8条回答

粉色の甜心

2020-12-04 07:54

First of all, you need to check that memory is aligned on 16 byte boundary, otherwise you get penalties. This is the most important thing.

If you don't need a standard-compliant solution, you could check if things improve by using some compiler specific extension such as memcpy64 (check with your compiler doc if there's something available). Fact is that memcpymust be able to deal with single byte copy, but moving 4 or 8 bytes at a time is much faster if you don't have this restriction.

Again, is it an option for you to write inline assembly code?

0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-04 07:54

You can write a better implementation of memcpy using SSE2 registers. The version in VC2010 does this already. So the question is more, if you are handing it aligned memory.

Maybe you can do better then the version of VC 2010, but it does need some understanding, of how to do it.

PS: You can pass the buffer to the user mode program in an inverted call, to prevent the copy altogether.

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-04 07:55

One thing to be aware of is that your process (and hence the performance of memcpy()) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).

Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy() operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy() is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc() API to ensure that your destination memory for the memcpy() is committed (I think VirtualAlloc() is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).

Finally, see if you can use the technique explained by Skizz to avoid the memcpy() altogether - that's your best bet if resources permit.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-04 07:55

Perhaps you can explain some more about how you're processing the larger memory area?

Would it be possible within your application to simply pass ownership of the buffer, rather than copy it? This would eliminate the problem altogether.

Or are you using memcpy for more than just copying? Perhaps you're using the larger area of memory to build a sequential stream of data from what you've captured? Especially if you're processing one character at a time, you may be able to meet halfway. For example, it may be possible to adapt your processing code to accommodate for a stream represented as ‘an array of buffers’, rather than ‘a continuous memory area’.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-04 07:57

One source I would recommend you read is MPlayer's fast_memcpy function. Also consider the expected usage patterns, and note that modern cpus have special store instructions which let you inform the cpu whether or not you will need to read back the data you're writing. Using the instructions that indicate you won't be reading back the data (and thus it doesn't need to be cached) can be a huge win for large memcpy operations.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-12-04 08:09

I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.

~~Try this implementation.~~

Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页