If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.
See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Both the manual and my tests show that the benefits of REP STOSB appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al;inc edi;dec ecx;jnz Clear) would be much faster, since the startup costs of REP STOSB are very high - about 35 cycles.
In order to get the benefits of REP STOSB on new processors with CPUID ERMSB bit, the following conditions shouls be met:
- the destination buffer have to be aligned to a 16-Byte boundary;
- if the length is a multiple of 64, it can produce even higher performance;
- the direction bit should be set "forward" (CLD).
ERMSB begin to outperform other methods when the length is at least 128 bytes, because, as I wrote, there is high internal startup in ERMSB - about 35 cycles.
ERMSB begin to clearly outperform other methods when the length is more than 2048 bytes.
When the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience smaller degradation when the destination is misaligned.
I have Intel Core i5 6600 processor with 32K L1 cache, 256K L2 cache and 6MB L3 cache, and I could obtain ~100 GB/sec on REP STOSB with 32K blocks.
Here are the results of REP STOSB memset() implementation:
- 1297920000 data blocks of 16 bytes took 13.6022 seconds to process by
memset(); 1455.9909 Megabytes per second
- 648960000 data blocks of 32 bytes took 6.7840 seconds to process by memset(); 2919.3058 Megabytes per second
- 1622400000 data blocks of 64 bytes took 16.9762 seconds to process by memset(); 5833.0883 Megabytes per second
- 817587402 data blocks of 127 bytes took 8.5698 seconds to process by memset(); 11554.8914 Megabytes per second
- 811200000 data blocks of 128 bytes took 8.5197 seconds to process by memset(); 11622.9306 Megabytes per second
- 804911628 data blocks of 129 bytes took 9.1513 seconds to
process by memset(); 10820.6427 Megabytes per second
- 407190588 data blocks of 255 bytes took 5.4656 seconds to process by memset(); 18117.7029 Megabytes per second
- 405600000 data blocks of 256 bytes took 5.0314 seconds to process by memset(); 19681.1544 Megabytes per second
- 202800000 data blocks of 512 bytes took 2.7403 seconds to process by memset(); 36135.8273 Megabytes per second
- 101400000 data blocks of 1024 bytes took 1.6704 seconds to process by memset(); 59279.5229 Megabytes per second
- 3168750 data blocks of 32768 bytes took 0.9525 seconds to process by memset(); 103957.8488 Megabytes per second
- 2028000 data blocks of 51200 bytes took 1.5321 seconds to process by memset(); 64633.5697 Megabytes per second
- 413878 data
blocks of 250880 bytes took 1.7737 seconds to process by memset();
55828.1341 Megabytes per second
- 19805 data blocks of 5242880 bytes took 2.6009 seconds to process by memset(); 38073.0694 Megabytes per second
Here are the results of memset() implementation that uses MOVDQA [RCX],XMM0:
- 1297920000 data blocks of 16 bytes took 3.5795 seconds to process by memset(); 5532.7798 Megabytes per second
- 648960000 data blocks of 32 bytes took 5.5538 seconds to process by memset(); 3565.9727 Megabytes per second
- 1622400000 data blocks of 64 bytes took 15.7489 seconds to process by memset(); 6287.6436 Megabytes per second
- 817587402 data blocks of 127 bytes took 9.6637 seconds to process by memset(); 10246.9173 Megabytes per second
- 811200000 data blocks of 128 bytes took 9.6236 seconds to process by memset(); 10289.6215 Megabytes per second
- 804911628 data blocks of 129 bytes took 9.4852 seconds to process by memset(); 10439.7473 Megabytes per second
- 407190588 data blocks of 255 bytes took 6.6156 seconds to process by memset(); 14968.1754 Megabytes per second
- 405600000 data blocks of 256 bytes took 6.6437 seconds to process by memset(); 14904.9230 Megabytes per second
- 202800000 data blocks of 512 bytes took 5.0695 seconds to process by memset(); 19533.2299 Megabytes per second
- 101400000 data blocks of 1024 bytes took 4.3506 seconds to process by memset(); 22761.0460 Megabytes per second
- 3168750 data blocks of 32768 bytes took 3.7269 seconds to process by memset(); 26569.8145 Megabytes per second
- 2028000 data blocks of 51200 bytes took 4.0538 seconds to process by memset(); 24427.4096 Megabytes per second
- 413878 data blocks of 250880 bytes took 3.9936 seconds to process by memset(); 24795.5548 Megabytes per second
- 19805 data blocks of 5242880 bytes took 4.5892 seconds to process by memset(); 21577.7860 Megabytes per second
As you see, on 64-bit blocks REP MOVSB is slower, but starting from 128-byte blocks, REP MOVSB begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer.