I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays,
Historically, memmove and memcopy are the same function. They worked in the same way and had the same implementation. It was then realised that memcopy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.
The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. Memcopy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.
The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.
You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.
Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)
Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.