How to use AVX/SIMD with nested loops and += format?

问题

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.

This is the code I would like to change into a SIMD/AVX implementation.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
    temp[i] = 0.0;
    for (size_t j = 0; j < npages; j++) {
        temp[i] += P[j] * matrix_cap[IDX(i,j)];
    }
}

For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.

I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.

As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.

ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
    for (size_t j = 0; j < mod; j += 4) {
        __m256d p = _mm256_loadu_pd(P + j);
        __m256d m = _mm256_loadu_pd(matrix_hat + i + j);
        __m256d pm = _mm256_mul_pd(p, m);
        _mm256_storeu_pd(&res + j, pm);
        for (size_t k = 0; k < 4; k++) {
            sum += res[j + k];
        }
    }
    for (size_t i = mod; i < npages; i++) {
        for (size_t j = 0; j < npages; j++) {
            sum += P[j] * matrix_cap[IDX(i,j)];
        }
    }
    temp[i] = sum;
    sum = 0.0;
}

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

回答1:

First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.

You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.

(see the bottom for a suggestion on a different data arrangement for the matrix)

Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.

So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)

It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.

FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).

However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.

So you should do something like

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
    __m256d s0 = _mm256_setzero_pd(),
            s1 = _mm256_setzero_pd(),
            s2 = _mm256_setzero_pd(),
            ...
            s7 = _mm256_setzero_pd();   // 8 accumulators for 8 i values
    for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
        __m256d Pj = _mm256_loadu_pd(P+j);        // reused 8 times after loading
        //temp[i] += P[j] * matrix_cap[IDX(i,j)];
        s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
        s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
        // ...
        s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
    }
    // or do this block with a hsum+transpose and do vector stores.
    // taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
    temp[i+0] = hsum_pd256(s0);   // See the horizontal-sum link earlier for how to write this function
    temp[i+1] = hsum_pd256(s1);
    //...
    temp[i+7] = hsum_pd256(s7);

    // if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.

You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.

Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.

P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

回答2:

Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
    my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
    #pragma omp simd reduction (+:tmp_reduction)
    for (size_t j = 0; j < npages; j++) {
        tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
    }
    temp[i] = tmp_reduction;
}

For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.

P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).

来源：https://stackoverflow.com/questions/37524711/how-to-use-avx-simd-with-nested-loops-and-format

标签

multithreading

simd

avx

pagerank