I like compiler inline optimization of memcpy instructions. On the arm processor I have found that it seems to inline up to 64 bytes with both gcc and clang. Without the inl