I have a function that is doing memcpy, but it\'s taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?
Here is an alternative C version of memcpy that is inlineable and I find it outperforms memcpy for GCC for Arm64 by about 50% in the application I used it for. It is 64-bit platform independent. The tail processing can be removed if the usage instance does not need it for a bit more speed. Copies uint32_t arrays, smaller datatypes not tested but might work. Might be able to adapt for other datatypes. 64-bit copy (two indexes are copied simultaneously). 32-bit should also work but slower. Credits to Neoscrypt project.
static inline void newmemcpy(void *__restrict__ dstp,
void *__restrict__ srcp, uint len)
{
ulong *dst = (ulong *) dstp;
ulong *src = (ulong *) srcp;
uint i, tail;
for(i = 0; i < (len / sizeof(ulong)); i++)
*dst++ = *src++;
/*
Remove below if your application does not need it.
If console application, you can uncomment the printf to test
whether tail processing is being used.
*/
tail = len & (sizeof(ulong) - 1);
if(tail) {
//printf("tailused\n");
uchar *dstb = (uchar *) dstp;
uchar *srcb = (uchar *) srcp;
for(i = len - tail; i < len; i++)
dstb[i] = srcb[i];
}
}