sse | 易学教程

How do I perform 8 x 8 matrix operation using SSE?

阅读更多关于 How do I perform 8 x 8 matrix operation using SSE?

My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

Matrix-vector-multiplication in AVX not proportionately faster than in SSE

阅读更多关于 Matrix-vector-multiplication in AVX not proportionately faster than in SSE

I was writing a matrix-vector-multiplication in both SSE and AVX using the following: for(size_t i=0;i<M;i++) { size_t index = i*N; __m128 a, x, r1; __m128 sum = _mm_setzero_ps(); for(size_t j=0;j<N;j+=4,index+=4) { a = _mm_load_ps(&A[index]); x = _mm_load_ps(&X[j]); r1 = _mm_mul_ps(a,x); sum = _mm_add_ps(r1,sum); } sum = _mm_hadd_ps(sum,sum); sum = _mm_hadd_ps(sum,sum); _mm_store_ss(&C[i],sum); } I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent instruction to _mm_store_ss() , I used: _mm_store_ss(&C[i],_mm256_castps256_ps128(sum)); The SSE code gives

How do you move 128-bit values between XMM registers?

阅读更多关于 How do you move 128-bit values between XMM registers?

Seemingly trivial problem in assembly: I want to copy the whole XMM0 register to XMM3. I've tried movdq xmm3, xmm0 but MOVDQ cannot be used to move values between two XMM registers. What should I do instead? It's movapd , movaps , or movdqa movaps xmm3, xmm0 They all do the same thing, but there's a catch: movapd and movaps operate in the floating-point domain. movdqa operates in the integer domain Use the appropriate one according to your datatype to avoid domain-changing stalls. Also, there's no reason to use movapd . Always use movaps instead because movapd takes an extra byte to encode. 来源

How do initialize an SIMD vector with a range from 0 to N?

阅读更多关于 How do initialize an SIMD vector with a range from 0 to N?

问题 I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str

Choice between aligned vs. unaligned x86 SIMD instructions

阅读更多关于 Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax] But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions

performance of SSE and AVX when both Memory-band width limited

阅读更多关于 performance of SSE and AVX when both Memory-band width limited

问题 In the code below I changed the "dataLen" and get different efficiency. dataLen = 400 SSE time:758000 us AVX time:483000 us SSE > AVX dataLen = 2400 SSE time:4212000 us AVX time:2636000 us SSE > AVX dataLen = 2864 SSE time:6115000 us AVX time:6146000 us SSE ~= AVX dataLen = 3200 SSE time:8049000 us AVX time:9297000 us SSE < AVX dataLen = 4000 SSE time:10170000us AVX time:11690000us SSE < AVX The SSE and AVX code can be both simplified into this: buf3[i] += buf1[1]*buf2[i]; #include "testfun.h

Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

阅读更多关于 Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

Along with the introduction of AVX, Intel introduced the VEX encoding scheme into the Intel 64 and IA-32 architecture. This encoding scheme is used mostly with AVX instructions. I was wondering if it's okay to intermix VEX-encoded instructions and the now called "legacy SSE" instructions. The main reason for me asking this question is code size. Consider these two instructions : shufps xmm0, xmm0, 0 vshufps xmm0, xmm0, xmm0, 0 I commonly use the first one to "broadcast" a scalar value to all the places in an XMM register. Now, the instruction set says that the only difference between these two

Faster quaternion vector multiplication doesn't work

阅读更多关于 Faster quaternion vector multiplication doesn't work

问题 I need a faster quaternion-vector multiplication routine for my math library. Right now I'm using the canonical v' = qv(q^-1) , which produces the same result as multiplying the vector by a matrix made from the quaternion, so I'm confident in it's correctness. So far I've implemented 3 alternative "faster" methods: #1, I have no idea where I got this one from: v' = (q.xyz * 2 * dot(q.xyz, v)) + (v * (q.w*q.w - dot(q.xyz, q.zyx))) + (cross(q.xyz, v) * q.w * w) Implemented as: vec3 rotateVector

Strange uint32_t to float array conversion

阅读更多关于 Strange uint32_t to float array conversion

I have the following code snippet: #include <cstdio> #include <cstdint> static const size_t ARR_SIZE = 129; int main() { uint32_t value = 2570980487; uint32_t arr[ARR_SIZE]; for (int x = 0; x < ARR_SIZE; ++x) arr[x] = value; float arr_dst[ARR_SIZE]; for (int x = 0; x < ARR_SIZE; ++x) { arr_dst[x] = static_cast<float>(arr[x]); } printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!"); printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]); printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]); return 0; } If I compile it under MS Visual Studio 2015 I can see that the output is:

Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

阅读更多关于 Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call. Inputs: IN = ...1100010010010100... MASK = ...0001111010111011... Output: OUT = ...0001111010111000... edit: another example result from some comment discussion IN = ...11111110011010110... MASK = ...01011011001111110... Output: OUT = ...01011011001111110... I want to get the contiguous adjacent 1 bits of MASK that a 1 bit of IN is within. (Is there a general term for this kind of operation? Maybe I'm not