We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.
I modified the benchmark to use the nsec timer in Linux and found similar variation on different processors, all with similar memory. All running RHEL 6. Numbers are consistent across multiple runs.
Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, L2/L3 256K/20M, 16 GB ECC
malloc for 1073741824 took 47us
memset for 1073741824 took 643841us
memcpy for 1073741824 took 486591us
Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, L2/L3 256K/12M, 12 GB ECC
malloc for 1073741824 took 54us
memset for 1073741824 took 789656us
memcpy for 1073741824 took 339707us
Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, L2 256K/8M, 12 GB ECC
malloc for 1073741824 took 126us
memset for 1073741824 took 280107us
memcpy for 1073741824 took 272370us
Here are results with inline C code -O3
Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, 256K/20M, 16 GB
malloc for 1 GB took 46 us
memset for 1 GB took 478722 us
memcpy for 1 GB took 262547 us
Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, 256K/12M, 12 GB
malloc for 1 GB took 53 us
memset for 1 GB took 681733 us
memcpy for 1 GB took 258147 us
Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, 256K/8M, 12 GB
malloc for 1 GB took 67 us
memset for 1 GB took 254544 us
memcpy for 1 GB took 255658 us
For the heck of it, I also tried making the inline memcpy do 8 bytes at a time. On these Intel processors it made no noticeable difference. Cache merges all of the byte operations into the minimum number of memory operations. I suspect the gcc library code is trying to be too clever.