SIMD optimization of a curve computed from the second derivative

问题

This question is really a curiosity.

I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code:

// args:
uint32_t phase_current;
uint32_t phase_increment;
uint32_t phase_increment_step;

for (int i = 0; i < blockSize; ++i)
{
    USEFUL_FUNC(phase_current);
    phase_increment += phase_increment_step;
    phase_current += phase_increment;
}

The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply trying to compute a correct vector of phase_current for processing, what is the right way of dealing with the phase_current being dependent on its previous value?

In turn, a functional programming fold-like implementation would be similarly useful, since I'm trying to understand how to hoist out a data dependency more that I am trying to optimize for the sake of optimizing.

Lastly, if you can recommend some literature, please do. Not sure how to Google for this topic.

回答1:

So you're just looking for a way to generate vectors of 4 phase_current values, which you can pass as an arg to an arbitrary function.

TL:DR: set up the initial vectors for increment and step so each vector element strides through the sequence by 4, giving you vectors of phase_current[i+0..i+3] with still only two vector ADD operations (vertical, not horizontal). This serial dependency is one you can factor out with algebra / math.

This is a bit like a prefix-sum (which you can SIMD with log2(vector_width) shuffle+add operations for vectors with vector_width elements.) You can also parallelize prefix sums with multiple threads using a two-step calculation where each thread prefix-sums a region of an array, then combine the results and have each thread offset its region of the destination array by a constant (the sum total for the first element of that region. See the linked question for multi-threading, too.

But you have the huge simplification that phase_increment_step (the 2nd derivative of the value you want) is constant. I'm assuming that USEFUL_FUNC(phase_current); takes its arg by value, not by non-const reference, so the only modification to phase_current is by the += in the loop. And that useful_func can't somehow mutate the increment or increment_step.

One option to implement this is just to run the scalar algorithm independently in 4 separate elements of SIMD vectors, offset by 1 iteration each time. With integer adds, especially on Intel CPUs where vector-integer add latency is only 1 cycle, running 4 iterations of the running-total is cheap, and we could just do that between calls to USEFUL_FUNC. That would be a way to generate vector inputs to USEFUL_FUNC doing exactly as much work as scalar code (assuming SIMD integer add is as cheap as scalar integer add, which is mostly true if we're limited by the data dependency to 2 adds per clock).

The above method is somewhat more general and could be useful for variations on this problem where there's a true serial dependency that we can't eliminate cheaply with simple math.

If we're clever, we can do even better than a prefix sum or brute-force running 4 sequences one step at a time. Ideally we can derive a closed-form way to step by 4 in the sequence of values (or whatever the SIMD vector width is, times whatever unroll factor you want for multiple accumulators for USEFUL_FUNC).

Summing a sequence of step, step*2, step*3, ... will give us a constant times Gauss's closed-form formula for the sum of integers up to n: sum(1..n) = n*(n+1)/2. This sequence goes 0, 1, 3, 6, 10, 15, 21, 28, ... (https://oeis.org/A000217). (I've factored out the initial phase_increment).

The trick is going by 4 in this sequence. (n+4)*(n+5)/2 - n*(n+1)/2 simplifies down to 4*n + 10. Taking the derivative of that again, we get 4. But to go 4 steps in the 2nd integral, we have 4*4 = 16. So we can maintain a vector phase_increment which we increment with a SIMD add with a vector of 16*phase_increment_step.

I'm not totally sure I have the step-counting reasoning right (the extra factor of 4 to give 16 is a bit surprising). Working out the right formulas and taking first and second-differences in the sequence of vectors makes it very clear how this works out:

 // design notes, working through the first couple vectors
 // to prove this works correctly.

S = increment_step (constant)
inc0 = increment initial value
p0 = phase_current initial value

// first 8 step-increases:
[ 0*S,  1*S,   2*S,  3*S ]
[ 4*S,  5*S,   6*S,  7*S ]

// first vector of 4 values:
[ p0,  p0+(inc0+S),  p0+(inc0+S)+(inc0+2*S),  p0+(inc0+S)+(inc0+2*S)+(inc0+3*S) ]
[ p0,  p0+inc0+S,  p0+2*inc0+3*S,  p0+3*inc0+6*S ]  // simplified

// next 4 values:
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+7*inc0+28*S ]

Using this and the earlier 4*n + 10 formula:

// first 4 vectors of of phase_current
[ p0,              p0+1*inc0+ 1*S,  p0+2*inc0+3*S,   p0+ 3*inc0+ 6*S ]
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+ 7*inc0+28*S ]
[ p0+8*inc0+36*S,  p0+9*inc0+45*S,  p0+10*inc0+55*S, p0+11*inc0+66*S ]
[ p0+12*inc0+78*S,  p0+13*inc0+91*S,  p0+14*inc0+105*S, p0+15*inc0+120*S ]

 first 3 vectors of phase_increment (subtract consecutive phase_current vectors):
[ 4*inc0+10*S,     4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S  ]
[ 4*inc0+26*S,     4*inc0 + 30*S,   4*inc0 + 34*S,   4*inc0 + 38*S  ]
[ 4*inc0+42*S,     4*inc0 + 46*S,   4*inc0 + 50*S,   4*inc0 + 54*S  ]

 first 2 vectors of phase_increment_step:
[        16*S,              16*S,            16*S,            16*S  ]
[        16*S,              16*S,            16*S,            16*S  ]
Yes, as expected, a constant vector works for phase_increment_step

So we can write code like this with Intel's SSE/AVX intrinsics:

#include <stdint.h>
#include <immintrin.h>

void USEFUL_FUNC(__m128i);

// TODO: more efficient generation of initial vector values
void double_integral(uint32_t phase_start, uint32_t phase_increment_start, uint32_t phase_increment_step, unsigned blockSize)
{

    __m128i pstep1 = _mm_set1_epi32(phase_increment_step);

    // each vector element steps by 4
    uint32_t inc0=phase_increment_start, S=phase_increment_step;
    __m128i pincr  = _mm_setr_epi32(4*inc0 + 10*S,  4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S);

    __m128i phase = _mm_setr_epi32(phase_start,  phase_start+1*inc0+ 1*S,  phase_start+2*inc0+3*S,   phase_start + 3*inc0+ 6*S );
     //_mm_set1_epi32(phase_start); and add.
     // shuffle to do a prefix-sum initializer for the first vector?  Or SSE4.1 pmullo by a vector constant?

    __m128i pstep_stride = _mm_slli_epi32(pstep1, 4);  // stride by pstep * 16
    for (unsigned i = 0; i < blockSize; ++i)  {
        USEFUL_FUNC(phase);
        pincr = _mm_add_epi32(pincr, pstep_stride);
        phase = _mm_add_epi32(phase, pincr);
    }
}

Further reading: for more about SIMD in general, but mostly x86 SSE/AVX, see https://stackoverflow.com/tags/sse/info, especially slides from SIMD at Insomniac Games (GDC 2015) which have some good stuff about how to think about SIMD in general, and how to lay out your data so you can use it.

回答2:

The only thing I could think about is horizontal add. Imagine you have __m128i vector with content {pc, 0, pi, pis}. Then first HADD will make it into {pc, pi + pis}, and second HADD will make it into pc + pi + pis.

HADD operates on two __m128i at once, so some speed-up is possible.

But interleaving instructions such that pipeline is always full won't be a trivial exercise. Link to HADD: https://msdn.microsoft.com/en-us/library/bb531452(v=vs.120).aspx

Let me add link to very useful discussion wrt HADD for floats. A lot of code and conclusions could be directly applied to integer HADD: Fastest way to do horizontal float vector sum on x86

来源：https://stackoverflow.com/questions/47983660/simd-optimization-of-a-curve-computed-from-the-second-derivative

标签

c++

optimization

vectorization

simd