问题
This question already has an answer here:
- How to sum __m256 horizontally? 2 answers
I have two arrays of floats and I would like to calculate the dot product, using SSE and AVX, in the lowest latency possible. I am aware there is a 256-bit dot product intrinsic for floats but I have read on SO that this is slower than the below technique: (https://stackoverflow.com/a/4121295/997112).
I have done most of the work, the vector temp_sums
contains all the sums, I just need to sum all the eight 32-bit sums contained within temp_sum
at the end.
#include "xmmintrin.h"
#include "immintrin.h"
int main(){
const int num_elements_in_array = 16;
__declspec(align(32)) float x[num_elements_in_array];
__declspec(align(32)) float y[num_elements_in_array];
x[0] = 2; x[1] = 2; x[2] = 2; x[3] = 2;
x[4] = 2; x[5] = 2; x[6] = 2; x[7] = 2;
x[8] = 2; x[9] = 2; x[10] = 2; x[11] = 2;
x[12] = 2; x[13] = 2; x[14] = 2; x[15] = 2;
y[0] = 3; y[1] = 3; y[2] = 3; y[3] = 3;
y[4] = 3; y[5] = 3; y[6] = 3; y[7] = 3;
y[8] = 3; y[9] = 3; y[10] = 3; y[11] = 3;
y[12] = 3; y[13] = 3; y[14] = 3; y[15] = 3;
__m256 a;
__m256 b;
__m256 temp_products;
__m256 temp_sum = _mm256_setzero_ps();
unsigned short j = 0;
const int sse_data_size = 32;
int num_values_to_process = sse_data_size/sizeof(float);
while(j < num_elements_in_array){
a = _mm256_load_ps(x+j);
b = _mm256_load_ps(y+j);
temp_products = _mm256_mul_ps(b, a);
temp_sum = _mm256_add_ps(temp_sum, temp_products);
j = j + num_values_to_process;
}
//Need to "process" temp_sum as a final value here
}
I am worried the 256-bit intrinsics I require are not available up to AVX 1.
回答1:
I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):
static inline float _mm256_reduce_add_ps(__m256 x) {
/* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
/* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
/* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
/* Conversion to float is a no-op on x86-64 */
return _mm_cvtss_f32(x32);
}
回答2:
You can emulate a full horizontal add with AVX (i.e. a proper 256 bit version of _mm256_hadd_ps
) like this:
#define _mm256_full_hadd_ps(v0, v1) \
_mm256_hadd_ps(_mm256_permute2f128_ps(v0, v1, 0x20), \
_mm256_permute2f128_ps(v0, v1, 0x31))
If you're just working with one input vector then you may be able to simplify this a little.
来源:https://stackoverflow.com/questions/23189488/horizontal-sum-of-32-bit-floats-in-256-bit-avx-vector