sse

C++ use SSE instructions for comparing huge vectors of ints

余生颓废 提交于 2019-11-29 12:59:07
问题 I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function: int getDiff(int indx1, int indx2) { int result = 0; int pplus, pminus, tmp; for (int k = 0; k < 128; k += 2) { pplus = nodeL[indx2][k] - nodeL[indx1][k]; pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1]; tmp = max(pplus, pminus); if (tmp > result) { result = tmp; } } return result; } As you see, the function, loops through the two row vectors does some

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

夙愿已清 提交于 2019-11-29 12:56:58
I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that: 1) 128-bit vector registers XMM are used; 2) SSE2 instruction MOVSD is invoked. I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things: 1) I never

Where is VPERMB in AVX2?

℡╲_俬逩灬. 提交于 2019-11-29 12:41:06
问题 AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime 1 . Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions 3 . Cool beans. So where is VPERMB ? I.e., the same

How to enable sse3 autovectorization in gcc

大兔子大兔子 提交于 2019-11-29 11:41:06
I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics but I am interested in whether it is possible to get gcc to auto-vectorize the code. Here is some sample code #include <complex.h> complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *= x[i]; return p; } The assembly you get from gcc -S -O3 -ffast-math is: .file "test.c" .section .text.unlikely,"ax",@progbits .LCOLDB2: .text

SSE 4 instructions generated by Visual Studio 2013 Update 2 and Update 3

。_饼干妹妹 提交于 2019-11-29 11:27:24
问题 If I compile this code in VS 2013 Update 2 or Update 3: (below comes from Update 3) #include "stdafx.h" #include <iostream> #include <random> struct Buffer { long* data; int count; }; #ifndef max #define max(a,b) (((a) > (b)) ? (a) : (b)) #endif long Code(long* data, int count) { long nMaxY = data[0]; for (int nNode = 0; nNode < count; nNode++) { nMaxY = max(data[nNode], nMaxY); } return(nMaxY); } int _tmain(int argc, _TCHAR* argv[]) { #ifdef __AVX__ static_assert(false, "AVX should be

How to properly use prefetch instructions?

蹲街弑〆低调 提交于 2019-11-29 11:21:50
I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this: __m128* A, B; __m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0); for(size_t i=0; i<1048576;i+=4) { dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]); dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]); dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]); dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]); } ... // add dots, then shuffle/hadd result. I heard that using prefetch instructions could help

Shifting 4 integers right by different values SIMD

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-29 11:02:48
SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this. i[0] = i[0] & 0b111111; i[1] = (i[1]>>6) & 0b111111; i[2] = (i[2]>>12) & 0b111111; i[3] = (i[3]>>18) & 0b111111; Essentially trying to isolate a different group of 6 bits in each integer. So what is the optimal solution? Things I thought about: You can simulate a variable right shift, with a variable left shift and a uniform right shift. I thought about multiplying the

Using ymm registers as a “memory-like” storage location

China☆狼群 提交于 2019-11-29 10:22:57
Consider the following loop in x86: ; on entry, rdi has the number of iterations .top: ; some magic happens here to calculate a result in rax mov [array + rdi * 8], rax ; store result in output array dec rdi jnz .top It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi . I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write to memory). As long as the loop count in rdi is limited, I could use the ample space (512 bytes)

inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch

别等时光非礼了梦想. 提交于 2019-11-29 10:03:41
I am trying to compile a C program using cmake which uses SIMD intrinsics. When I try to compile it, I get two errors /usr/lib/gcc/x86_64-linux-gnu/5/include/ smmintrin.h :326:1: error: inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch _mm_mullo_epi32 (__m128i __X, __m128i __Y) /usr/lib/gcc/x86_64-linux-gnu/5/include/ tmmintrin.h :136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch _mm_shuffle_epi8 (__m128i __X, __m128i __Y) This issue has already been solved here StackOverflow by setting set(CMAKE

Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

岁酱吖の 提交于 2019-11-29 09:59:13
Why does _mm_extract_ps return an int instead of a float ? What's the proper way to read a single float from an XMM register in C? Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction? From the MSDN docs , I believe you can cast the result to a float. Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]). Update: I strongly recommend the answers from @doug65536 and @PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers. None of the answers appear to actually answer the