avx | 易学教程

AVX/SSE version of xorshift128+

阅读更多关于 AVX/SSE version of xorshift128+

问题 I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

阅读更多关于 Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right" for that)? I'd consider also performances in having this kind of switch in the software. There are

AVX/SSE version of xorshift128+

阅读更多关于 AVX/SSE version of xorshift128+

I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what

Are different mmx, sse and avx versions complementary or supersets of each other?

阅读更多关于 Are different mmx, sse and avx versions complementary or supersets of each other?

问题 I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I

Shuffling by mask with Intel AVX

阅读更多关于 Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register. The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2): {(0,0),(1,1),(1,4),(2,5),...} This means several bytes are copied twice. I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes. __m256 _mm256

Handling zeroes in _mm256_rsqrt_ps()

阅读更多关于 Handling zeroes in _mm256_rsqrt_ps()

问题 Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps() , looking around it seems that doing: _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); Is the way to go for that extra bit of performance and avoiding a pipeline stall. Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0) . What is the best way around this? I have tried this (which works and is faster), but is there a better

Setting m256i to the value of two m128i values

阅读更多关于 Setting __m256i to the value of two __m128i values

问题 So, AVX has a function from immintrin.h , which should allow to store the concatenation of two __m128i values into a single __m256i value. The function is __m256i _mm256_set_m128i (__m128i hi, __m128i lo) However, when I use it, like so: __m256i as[2]; __m128i s[4]; as[0] = _mm256_setr_m128i(s[0], s[1]); I get a compilation error: error: incompatible types when assigning to type ‘__m256i’ from type ‘int’ I don't really understand why this happens. Any help is greatly appreciated! 回答1: Not all

How to align stack at 32 byte boundary in GCC?

阅读更多关于 How to align stack at 32 byte boundary in GCC?

问题 I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

阅读更多关于 How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

问题 I have implemented an inline function ( _mm256_concat_epi16 ). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would be better to use a single inline function in my main program. The question is : Is there any better solution than mine or any suggestion to make this inline function more general which works on 16 values instead of my solution that works on 8

8 bit shift operation in AVX2 with shifting in zeros

阅读更多关于 8 bit shift operation in AVX2 with shifting in zeros

问题 Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has