Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

后端 未结 3 2080
滥情空心
滥情空心 2020-12-18 13:42

In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

#          


        
3条回答
  •  一个人的身影
    2020-12-18 14:05

    Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.

    __m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
    __m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
    __m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
    __m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
    __m256d b0000 = _mm256_permute_pd(b0101, 0);
    __m256d b1111 = _mm256_permute_pd(b0101, 0xf);
    __m256d b2222 = _mm256_permute_pd(b2323, 0);
    __m256d b3333 = _mm256_permute_pd(b2323, 0xf);
    

    In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.


    Edit: gcc compiles the two loads and the first 2 permutes into

    vmovapd (%rdi),%xmm8
    vmovapd 0x10(%rdi),%xmm4
    vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
    vperm2f128 $0x0,%ymm4,%ymm4,%ymm2
    

    Paul R's suggestion of using _mm256_broadcast_pd() can be written as:

    __m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
    __m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
    

    which compiles into

    vbroadcastf128 (%rdi),%ymm6
    vbroadcastf128 0x10(%rdi),%ymm11
    

    and is faster than doing two vmovapd+vperm2f128 (tested).

    In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.

提交回复
热议问题