sse | 易学教程

transpose for 8 registers of 16-bit elements on SSE2/SSSE3

阅读更多关于 transpose for 8 registers of 16-bit elements on SSE2/SSSE3

问题 (I'm a newbie to SSE/asm, apologies if this is obvious or redundant) Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.) Say you have registers v[0-7] and use t0-t7 as aux registers. In pseudo intrinsics code: /* Phase 1: process lower parts of the registers */ /* Level 1: work first part of the vectors */ /*

How to negate (change sign) of the floating point elements in a __m128 type variable?

阅读更多关于 How to negate (change sign) of the floating point elements in a __m128 type variable?

问题 Is there any single instruction or function that can invert the sign of every float inside a __m128? i.e. a = r0:r1:r2:r3 ===> a = -r0:-r1:-r2:-r3 ? I know this can be done by _mm_sub_ps(_mm_set1_ps(0.0),a) , but isn't it potentially slow since _mm_set1_ps(0.0) is a multi-instruction function? 回答1: In practice your compiler should do a good job of generating the constant vector for 0.0. It will probably just use _mm_xor_ps , and if your code is in a loop it should hoist the constant

How to multiply two quaternions with minimal instructions?

阅读更多关于 How to multiply two quaternions with minimal instructions?

After some thought, I came up with the following code for multiplying two quaternions using SSE: #include <pmmintrin.h> /* SSE3 intrinsics */ /* multiplication of two quaternions (x, y, z, w) x (a, b, c, d) */ __m128 _mm_cross4_ps(__m128 xyzw, __m128 abcd) { /* The product of two quaternions is: */ /* (X,Y,Z,W) = (xd+yc-zb+wa, -xc+yd+za+wb, xb-ya+zd+wc, -xa-yb-zc+wd) */ __m128 wzyx = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(0,1,2,3)); __m128 baba = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(0,1,0,1)); __m128 dcdc = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(2,3,2,3)); /* variable names below are for

What's the proper way to use different versions of SSE intrinsics in GCC?

阅读更多关于 What's the proper way to use different versions of SSE intrinsics in GCC?

I will ask my question by giving an example. Now I have a function called do_something() . It has three versions: do_something() , do_something_sse3() , and do_something_sse4() . When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly. The problem is: When I build my program with GCC, I have to set -msse4 for do_something_sse4() to compile (e.g. for the header file <smmintrin.h> to be included). However, if I set -msse4 , then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3() is also

SIMD programming languages

阅读更多关于 SIMD programming languages

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on programming assembly to get to the really nifty stuff. However, up until now I've hardly been able to find any programming language with built-in support for SIMD. Now obviously there are the shader languages such as HLSL, Cg and GLSL that have native support for this kind of stuff however, I'm looking for something that's able to at least compile to SSE without autovectorization but with built-in support for

_mm_crc32_u64 poorly defined

阅读更多关于 _mm_crc32_u64 poorly defined

问题 Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination

Setting m256i to the value of two m128i values

阅读更多关于 Setting __m256i to the value of two __m128i values

问题 So, AVX has a function from immintrin.h , which should allow to store the concatenation of two __m128i values into a single __m256i value. The function is __m256i _mm256_set_m128i (__m128i hi, __m128i lo) However, when I use it, like so: __m256i as[2]; __m128i s[4]; as[0] = _mm256_setr_m128i(s[0], s[1]); I get a compilation error: error: incompatible types when assigning to type ‘__m256i’ from type ‘int’ I don't really understand why this happens. Any help is greatly appreciated! 回答1: Not all

How to align stack at 32 byte boundary in GCC?

阅读更多关于 How to align stack at 32 byte boundary in GCC?

问题 I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

8 bit shift operation in AVX2 with shifting in zeros

阅读更多关于 8 bit shift operation in AVX2 with shifting in zeros

问题 Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has

Vectorizing Modular Arithmetic

阅读更多关于 Vectorizing Modular Arithmetic

问题 I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers. The function is void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) { for(int i = 0; i < dim; i++) { a[i] = (a[i]+b[i])%q; // LINE1 } } I'm compiling with icc -std=gnu99 -O3 (icc so I can use SVML later) on an IvyBridge (SSE4.2 and AVX, but not AVX2). My baseline is removing the %q from LINE1. 100 (iterated) function calls with dim