vector-processing

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

半世苍凉 提交于 2019-12-17 10:58:56
问题 This question already has answers here : Get sum of values stored in __m256d with SSE/AVX (2 answers) Closed 11 months ago . I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum, v_sum); Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I

What compilers besides gcc can vectorize code?

这一生的挚爱 提交于 2019-12-12 12:13:47
问题 GCC can vectorize loops automatically when certain options are specified and given the right conditions. Are there other compilers widely available that can do the same? 回答1: ICC 回答2: llvm can also do it and vector pascal too and one that is not free VectorC. These are just some I remember. 回答3: Also PGI's compilers. 回答4: The Mono project, the Open Source alternative to Microsoft's Silverlight project, has added objects that use SIMD instructions. While not a compiler, the Mono CLR is the

How to find the horizontal maximum in a 256-bit AVX vector

只愿长相守 提交于 2019-11-29 09:19:26
I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement. So the ideal solution will: 1) only use only AVX instructions. 2) minimize

How to vectorize with gcc?

流过昼夜 提交于 2019-11-28 04:25:03
The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done? casualcoder The original page offers details on getting gcc to automatically vectorize loops, including a few examples: http://gcc.gnu.org/projects/tree-ssa/vectorization.html While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now: https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info In summary, the following

How to find the horizontal maximum in a 256-bit AVX vector

非 Y 不嫁゛ 提交于 2019-11-27 06:46:26
问题 I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved

How to vectorize with gcc?

社会主义新天地 提交于 2019-11-27 00:23:38
问题 The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done? 回答1: The original page offers details on getting gcc to automatically vectorize loops, including a few examples: http://gcc.gnu.org/projects/tree-ssa/vectorization.html While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now: https:/