This question has been bothering me for some time. The possibilities I am considering are
Does a
I've made a small benchmark (VS 2018 Preview, MKL 2017 Update 4) to compare memcpy and the sequential version of cblas_?copy and found them to be equally fast on float and double.
memcpy
cblas_?copy
float
double