问题

I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times.

Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast.

The simple C++ loop case looks like this

        int _iNum;
        const unsigned int _uiDefaultOffset; // a constant 

        double * pInputValues1; // array of double values, count = _iNum;
        double * pInputValues2; 

        MyStruct * pOutput;    // array of outputs defined as
        // struct MyStruct 
        // { 
        //    float O1;
        //    float O2;
        //    unsigned int Offset;
        // };

        for (int i = 0; i < _iNum; ++i)
        {
            _pPoints[i].O1 = static_cast<float>(pInputValues1[i]);
            _pPoints[i].O2 = static_cast<float>(pInputValues2[i]);
            _pPoints[i].Offset = _uiDefaultOffset;
        }

Note: The struct format is [Float,Float,Int] (24 bytes) but we could (if it will help performance) add an extra 4bytes padding making it 32 bytes.

回答1:

Loosely inspired by Intel's 4x3 transposition example and based on @PeterCordes solution, here is an AVX1 solution, which should get a throughput of 8 structs within 8 cycles (bottleneck is still p5):

#include <immintrin.h>
#include <stddef.h>

struct f2u { 
  float O1, O2;
  unsigned int Offset;
};
static const unsigned uiDefaultOffset = 123;

void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m256 voffset = _mm256_castsi256_ps(_mm256_set1_epi32(uiDefaultOffset));

    // 8 structs per iteration
    ptrdiff_t i=0;
    for(; i<len-7; i+=8)
    {
        // destination address for next 8 structs as float*:
        float* dst_f = reinterpret_cast<float*>(dst + i);

        // 4*vcvtpd2ps    --->  4*(p1,p5,p23)
        __m128 inA3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i]));
        __m128 inB3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i]));
        __m128 inA7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i+4]));
        __m128 inB7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i+4]));

        // 2*vinsertf128  --->  2*p5
        __m256 A76543210 = _mm256_set_m128(inA7654,inA3210);
        __m256 B76543210 = _mm256_set_m128(inB7654,inB3210);

        // 2*vpermilps    --->  2*p5
        __m256 A56741230 = _mm256_shuffle_ps(A76543210,A76543210,_MM_SHUFFLE(1,2,3,0));
        __m256 B67452301 = _mm256_shuffle_ps(B76543210,B76543210,_MM_SHUFFLE(2,3,0,1));

        // 6*vblendps     ---> 6*p015 (does not need to use p5)
        __m256 outA1__B0A0 = _mm256_blend_ps(A56741230,B67452301,2+16*2);
        __m256 outA1ccB0A0 = _mm256_blend_ps(outA1__B0A0,voffset,4+16*4);

        __m256 outB2A2__B1 = _mm256_blend_ps(B67452301,A56741230,4+16*4);
        __m256 outB2A2ccB1 = _mm256_blend_ps(outB2A2__B1,voffset,2+16*2);

        __m256 outccB3__cc = _mm256_blend_ps(voffset,B67452301,4+16*4);
        __m256 outccB3A3cc = _mm256_blend_ps(outccB3__cc,A56741230,2+16*2);

        // 3* vmovups     ---> 3*(p237,p4)
        _mm_storeu_ps(dst_f+ 0,_mm256_castps256_ps128(outA1ccB0A0));
        _mm_storeu_ps(dst_f+ 4,_mm256_castps256_ps128(outB2A2ccB1));
        _mm_storeu_ps(dst_f+ 8,_mm256_castps256_ps128(outccB3A3cc));
        // 3*vextractf128 ---> 3*(p23,p4)
        _mm_storeu_ps(dst_f+12,_mm256_extractf128_ps(outA1ccB0A0,1));
        _mm_storeu_ps(dst_f+16,_mm256_extractf128_ps(outB2A2ccB1,1));
        _mm_storeu_ps(dst_f+20,_mm256_extractf128_ps(outccB3A3cc,1));
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

Godbolt link, with minimal test-code at the end: https://godbolt.org/z/0kTO2b

For some reason, gcc does not like to generate vcvtpd2ps which directly convert from memory to a register. This ~~might~~ works better with aligned loads (having input and output aligned is likely beneficial anyway). And clang apparently wants to outsmart me with one of the vextractf128 instructions at the end.

回答2:

Here is an attempt with SSE4.1, no AVX (that's trickier to do and so far I'm coming up with even more shuffles), and using the 12byte/point format: (not tested)

void test3(MyStruct * _pPoints, double * pInputValues1, double * pInputValues2) {
        // struct MyStruct 
        // { 
        //    float O1;
        //    float O2;
        //    unsigned int Offset;
        // };
    __m128 offset = _mm_castsi128_ps(_mm_cvtsi32_si128(_uiDefaultOffset));
    int i;
    for (i = 0; i < _iNum - 2; i += 2)
    {
        // read inputs and convert to float
        __m128d inA = _mm_loadu_pd(&pInputValues1[i]);
        __m128d inB = _mm_loadu_pd(&pInputValues2[i]);
        __m128 inAf = _mm_cvtpd_ps(inA);    // 0 0 A1 A0
        __m128 inBf = _mm_cvtpd_ps(inB);    // 0 0 B1 B0
        // shuffle B0 from place 0 to place 1, merge with offset
        __m128 tempA = _mm_shuffle_ps(inBf, offset, _MM_SHUFFLE(1, 0, 0, 0)); // 0 OF B0 B0
        // shuffle A1 from place 1 to place 0, merge with offset
        __m128 tempB = _mm_shuffle_ps(inAf, offset, _MM_SHUFFLE(1, 0, 1, 1)); // 0 OF A1 A1
        // replace B0 at place 0 with A0
        __m128 outA = _mm_blend_ps(tempA, inAf, 1);  // 0 OF B0 A0
        // replace A1 at place 1 with B1
        __m128 outB = _mm_blend_ps(tempB, inBf, 2);  // 0 OF B1 A1
        // store results
        _mm_storeu_ps(&_pPoints[i].O1, outA);
        _mm_storeu_ps(&_pPoints[i + 1].O1, outB);
    }
    // remaining iteration if _iNum is not even
    for (; i < _iNum; i++)
    {
        _pPoints[i].O1 = static_cast<float>(pInputValues1[i]);
        _pPoints[i].O2 = static_cast<float>(pInputValues2[i]);
        _pPoints[i].Offset = _uiDefaultOffset;
    }
}

This uses the ability of shufps to choose from two different sources to do the merging of the dynamic data and the constant offset, the same shuffles also move the float in each group that needs to move. Then blends are used to replace a single float with an other float that was already at the right place. This takes 2 shuffles and 2 blends, there is also a way with 3 shuffles and zero blends, but the shuffles all go to p5 on current Intel processors while the blend can go to a different port. The conversions already use p5 too so it's getting swamped, using the blends should be better. It's still 4 p5 µops per iteration so it takes at least 2 cycles per item processed, which is not great.

The main loop skips the last items so that it doesn't write out of bounds, it does slightly-overlapping 16 byte stores that write 4 bytes beyond the end of the struct. That part gets overwritten with the real result by the next store, but it might be dangerous to do that at the end of the array.

回答3:

This problem is not very similar to memcpy. It's all about optimizing the interleaving with shuffles and/or scalar store of the loop-invariant integer member. That makes SIMD hard.

Do you need to have this storage format with the int interleaved with the float members? Interleaving the floats is bad enough. I assume some later code is going to modify the ints in different structs, otherwise it makes no sense to duplicate it for every element.

Could you work in groups of 4 elements, like struct { float a[4], b[4]; int i[4]; }; so you could load+convert 4x contiguous double into 4x float and do a 128-bit SIMD store? You'd still have some spatial locality when accessing all 3 members of a single output-array "struct".

Anyway, assuming your output format has to be fully interleaved, we don't need to pad it to 16 bytes. x86 CPUs can efficiently handle overlapping 16-byte stores to work with 12-byte structs, like @harold's answer shows. Cache-line splits probably cost less than the extra memory bandwidth needed to store the padding.

Or another strategy would be to use separate stores for the floats vs. the int, so you don't need overlapping. We can probably optimize that to the point where it should bottleneck on 1 store per clock cycle for 1 struct per 2 cycles. (Or slightly lower because IIRC cache-split stores need to replay the store uop, at least on Intel CPUs.) We could also unroll by 4*12 = 3*16 bytes and save 2 integer stores by using SIMD stores that gets overlapped by float data. 48 bytes = xyIx|yIxy|IxyI has four I elements as part of four structs, but they're close enough that we can store all 4 with two _mm_storeu_si128( set1(offset) ) intrinsics. Then store the xy pairs overlapping with that. 16-byte boundaries are marked with |. If cache line splits are a problem, we could do 2x scalar and one SIMD for the last vector which is aligned (if the output array is 16-byte aligned). Or on Intel Haswell and later CPUs, a 32-byte aligned store might be good.

If we're not careful, we can very easily bottleneck on shuffle throughput on Intel CPUs, especially Sandybridge-family (SnB through Skylake/Coffee Lake) where FP shuffles can only run on port 5. This is why I'm considering not shuffling everything together for 1 store per struct.

SIMD double->float conversion costs 2 uops: shuffle + FP-math, because float is half the width and the instruction packs the floats into the bottom of the vector register.

AVX is useful here to convert 4 doubles into a SIMD vector of 4 floats.

Other than that, I agree with @harold that 128-bit vectors are probably a good bet. Even AVX2 doesn't have very good 2-input lane-crossing shuffles, and AVX1 is very limited. So we can use 256-bit -> 128-bit double->float conversion to feed an interleave strategy based on __m128.

vmovhps [mem], xmm doesn't cost a shuffle uop on Intel CPUs, just a pure store, so shuffling together 2 vectors and getting [ B1 A1 B0 A0 ] into a single vector sets us up for two 64-bit stores of the low and high halves without any extra shuffling.

OTOH, @harold's version might still be better. 4 shuffles per 2 structs might be better than 4 stores per 2 structs, since the stores will sometimes need to replay for cache line splits, but shuffles don't. But with the overlapping stores trick, 3.5 or 3 stores per 2 structs looks doable.

Or here's another idea that uses some of the above, but does some blending to save stores

I basically came up with this while editing @harold's code to implement the idea I wrote about in the text above. Using a blend here is a good way to reduce pressure on store and shuffle ports.

Some of those ideas above are still worth exploring, especially doing a wide store of set1(offset) and then overlapping it with 64-bit vmovlps stores. (After unrolling by 3x2 = 6 or 3x4 = 12 output structs, to make it a multiple of the 4 doubles we convert at once.) 12 * 12 = 144 bytes, which is a multiple of 16 but not 32 or 64, so we could at least know where we are relative to a 16-byte boundary at all times, but not to cache lines unless we unroll even more. (Potentially leaving more work that needs cleanup, and bloating code-size.)

#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>

struct f2u { 
  float O1, O2;
  unsigned int Offset;
};

// names with a leading _ at file scope are reserved for the implementation.
// fixed that portability problem for you.
static const unsigned uiDefaultOffset = 123;


// only requires AVX1
// ideally pA and pB should be 32-byte aligned.
// probably also dst 16-byte aligned is good.
void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m128 voffset = _mm_castsi128_ps(_mm_set1_epi32(uiDefaultOffset));

    // 48 bytes per iteration: 3x16 = 4x12
    ptrdiff_t i;
    for (i = 0; i < len - 3; i += 4)
    {
        // read inputs and convert to float
        __m256d inA = _mm256_loadu_pd(&pA[i]);
        __m256d inB = _mm256_loadu_pd(&pB[i]);
        __m128 inAf = _mm256_cvtpd_ps(inA);    // A3 A2 A1 A0
        __m128 inBf = _mm256_cvtpd_ps(inB);    // B3 B2 B1 B0

        // interleave to get XY pairs
        __m128 lo = _mm_unpacklo_ps(inAf, inBf); // B1 A1 B0 A0
        __m128 hi = _mm_unpackhi_ps(inAf, inBf); // B3 A3 B2 A2

        // blend integer into place
        __m128 out0 = _mm_blend_ps(lo, voffset, 1<<2);  // x OF B0 A0
        __m128 out2 = _mm_blend_ps(hi, voffset, 1<<2);  // x OF B2 A2

        // TODO: _mm_alignr_epi8 to create OF OF B1 A1 spending 1 more shuffle to save a store.

        // store results
        _mm_storeu_ps(&dst[i + 0].O1, out0);  // 16 bytes with blended integer
        _mm_storeh_pi((__m64*)&dst[i + 1].O1, lo);    // 8 bytes from top half of reg, partial overlap
        dst[i + 1].Offset = uiDefaultOffset;

        _mm_storeu_ps(&dst[i + 2].O1, out2);  // 16 bytes with blended integer
        _mm_storeh_pi((__m64*)&dst[i + 3].O1, hi);    // 8 bytes from top half of reg, partial overlap
        dst[i + 3].Offset = uiDefaultOffset;
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

gcc9.1 -O3 -march=skylake on Godbolt compiles the main loop to 19 fused-domain uops for the front-end. (Neither vcvtpd2ps instructions could micro-fuse because GCC didn't do anything clever like addressing pB relative to pA to avoid an indexed addressing mode for one of them. So they're each 3 uops: load + convert + shuffle)

But it does bottleneck on stores in the back-end anyway, even if it takes a full 5 cycles per iteration to issue from the 4-wide front-end.

With 6 stores (per 4 structs) per iteration, that will bottleneck it to at best 1 iteration per 6 cycles, bottlenecked on the store-data port/execution unit. (Until Ice Lake which can do 2 stores per clock.) So this achieves 1 struct per 1.5 cycles in the theoretical best case, same as I was estimating for the overlapping-store idea before.

(We already know that cache-line split stores are going to need to be replayed, costing throughput, so we know this won't quite manage 1.5 cycles per struct even with no cache misses. But it's probably still better than Harold's bottleneck of 4 cycles per 2 structs = 2 cycles per struct. That speed should actually be achievable though, because it bottlenecks on shuffles which don't need to get replayed on cache-line splits.)

I expect throughput on Ryzen will be similar, bottlenecked on store throughput. We're using 128-bit vectors mostly, and Ryzen has better shuffle throughput than Intel. On SnB-family, there are 4 shuffle uops in the loop.

If I could shuffle differently so I could get two contiguous structs as the high half of the pair of vectors, that would open up the possibility of combining the 2 scalar assignments into one _mm_storeu_si128 that I overlap with two _mm_storeh_pi (movhps) 64-bit stores. (Still doing two blends for the other two output structs.) That would get it down to 5 stores total.

But shufps has restrictions on where it takes source data from, so you can't use it to emulate unpcklps or interleave differently.

Probably it would be worth using palignr for the B1 A1 struct, spending an extra shuffle uop to save a store.

I haven't benchmarked this or calculated how often the unaligned stores will cross a cache line boundary (and thus cost throughput).

AVX512

If we had AVX512, we'd have 2-input lane-crossing shuffles that could let us build up vectors of float+int data more efficiently, with fewer shuffle and store instructions per struct. (We could use vpermt2ps with merge-masking into set1(integer) to interleave 2 vectors of conversion results along with integers in the right places.)

来源：https://stackoverflow.com/questions/57013636/fast-interleave-2-double-arrays-into-an-array-of-structs-with-2-float-and-1-int

标签

c++