When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if availab
Once upon a time rep movsb
was the optimal solution.
The original IBM PC had an 8088 processor with an 8-bit data bus and no caches. Then the fastest program was generally the one with the fewest number of instruction bytes. Having special instructions helped.
Nowadays, the fastest program is the one that can use as many CPU features as possible in parallel. Strange as it might seem at first, having code with many simple instructions can actually run faster than a single do-it-all instruction.
Intel and AMD keep the old instructions around mainly for backward compatibility.