sse | 易学教程

SSE matrix-matrix multiplication

阅读更多关于 SSE matrix-matrix multiplication

I'm having trouble doing matrix-matrix multiplication with SSE in C. Here is what I got so far: #define N 1000 void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) { int i, j, k; __m128i vA, vB, vR; for(i = 0; i < N; ++i) { for(j = 0; j < N; ++j) { vR = _mm_setzero_si128(); for(k = 0; k < N; k += 4) { //result[i][j] += mat1[i][k] * mat2[k][j]; vA = _mm_loadu_si128((__m128i*)&mat1[i][k]); vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled? vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB)); } vR = _mm_hadd_epi32(vR, vR); vR = _mm_hadd

Whats a good place to start learning assembly?

阅读更多关于 Whats a good place to start learning assembly?

问题 I need to learn assembly using SSE instructions and need gcc to link the ASM code with c code. I have no idea where to start and google hasn't helped. 回答1: You might want to start looking through the chip documentation from intel Intel Processor Software Developer Manuals. Assembly language coding isn't a whole lot of fun, and it's usually unneccessary except in few cases where code is performance critical. Given you are looking at SSE, I would hazard that your effort may be better spent

Why is there no floating point intrinsic for `PSHUFD` instruction?

阅读更多关于 Why is there no floating point intrinsic for `PSHUFD` instruction?

The task I'm facing is to shuffle one _m128 vector and store the result in the other one. The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector: _mm_shuffle_ps , which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move. _mm_shuffle_epi32 , which uses PSHUFD instruction that seems to do exactly what is expected here and can have better latency/throughput than SHUFPS . The latter intrinsic however works with integer vectors (

How to make the following code faster

阅读更多关于 How to make the following code faster

问题 int u1, u2; unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long res1, res2 initialized to zero. l = 60; while (l) { for (i = 0; i < 20; i += 2) { u1 = (elm1[i] >> l) & 15; u2 = (elm1[i + 1] >> l) & 15; for (k = 0; k < 20; k += 2) { simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]); simdb = _mm_load_si128 ((__m128i *) &res1[i + k]); simdb = _mm_xor_si128 (simda, simdb); _mm_store_si128 ((__m128i *)&res1[i + k], simdb); simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]

memcpy moving 128 bit in linux

阅读更多关于 memcpy moving 128 bit in linux

问题 I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ). There is a way to use that code inside linux ? Does anyone know where I can found an implementation

Passing types containing SSE/AVX values

阅读更多关于 Passing types containing SSE/AVX values

Let's say I have the following struct A { __m256 a; } struct B { __m256 a; float b; } Which of the following's generally better (if any and why) in a hard core loop? void f0(A a) { ... } void f1(A& a) { ... } //and the pointer variation void f2(B b) { ...} void f3(B& b) { ... } //and the pointer variation The answer is that it doesn't matter. According to this: http://msdn.microsoft.com/en-us/library/ms235286.aspx The calling convention states that 16-byte (and probably 32-byte) operands are always passed by reference. So even if you to pass by value, the compiler will pass it by reference

Avoiding AVX-SSE (VEX) Transition Penalties

阅读更多关于 Avoiding AVX-SSE (VEX) Transition Penalties

问题 Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode. I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions. I just want to use two ymm registers (and possibly zmm - the

Performance with SSE is the same

阅读更多关于 Performance with SSE is the same

问题 I vectorized the following loop, that crops up in an application that I am developing: void vecScl(Node** A, Node* B, long val){ int fact = round( dot / const); for(i=0; i<SIZE ;i++) (*A)->vector[i] -= fact * B->vector[i]; } And this is the SSE code: void vecSclSSE(Node** A, Node* B, long val){ int fact = round( dot / const); __m128i vecPi, vecQi, vecCi, vecQCi, vecResi; int sseBound = SIZE/4; for(i=0,j=0; j<sseBound ; i+=4,j++){ vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] ); vecQi

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

阅读更多关于 SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling. I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0) From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5. I googled a bit, but couldn't find a explanation. Is

How to perform element-wise left shift with __m128i?

阅读更多关于 How to perform element-wise left shift with __m128i?

问题 The SSE shift instructions I have found can only shift by the same amount on all the elements: _mm_sll_epi32() _mm_slli_epi32() These shift all elements, but by the same shift amount. Is there a way to apply different shifts to the different elements? Something like this: __m128i a, __m128i b; r0:= a0 << b0; r1:= a1 << b1; r2:= a2 << b2; r3:= a3 << b3; 回答1: There exists the _mm_shl_epi32() intrinsic that does exactly that. http://msdn.microsoft.com/en-us/library/gg445138.aspx However, it