Prefetching Examples?

前端 未结 5 660
南旧
南旧 2020-11-27 10:44

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantia

5条回答
  •  醉酒成梦
    2020-11-27 11:45

    Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.

    For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.

    Pre-fetching an uint32_t[32] original code example...

    int addrindex = &B[0];
        __builtin_prefetch(&V[addrindex]);
        __builtin_prefetch(&V[addrindex + 8]);
        __builtin_prefetch(&V[addrindex + 16]);
        __builtin_prefetch(&V[addrindex + 24]);
    

    After...

    int addrindex = &B[0];
    __builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
    __builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
    

    For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.

    the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.

    uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
    

    Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.

    My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

提交回复
热议问题