Consider this C code:
extern volatile int hardware_reg; void f(const void *src, size_t len) { void *dst = ; hardware_reg = 1;
It's probalby going to get optimized, either because the compiler inlines the mecpy call and eliminates the first assignment, or because it gets compiled to RISC code or machine code and gets optimized there.