avx

AVX/SSE version of xorshift128+

蓝咒 提交于 2019-11-30 14:58:49
问题 I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

本秂侑毒 提交于 2019-11-30 13:58:06
I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right" for that)? I'd consider also performances in having this kind of switch in the software. There are

AVX/SSE version of xorshift128+

余生颓废 提交于 2019-11-30 12:41:15
I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what

Are different mmx, sse and avx versions complementary or supersets of each other?

巧了我就是萌 提交于 2019-11-30 11:32:21
问题 I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I

Shuffling by mask with Intel AVX

自古美人都是妖i 提交于 2019-11-30 09:53:37
I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register. The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2): {(0,0),(1,1),(1,4),(2,5),...} This means several bytes are copied twice. I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes. __m256 _mm256

Handling zeroes in _mm256_rsqrt_ps()

折月煮酒 提交于 2019-11-30 09:47:57
问题 Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps() , looking around it seems that doing: _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); Is the way to go for that extra bit of performance and avoiding a pipeline stall. Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0) . What is the best way around this? I have tried this (which works and is faster), but is there a better

Setting __m256i to the value of two __m128i values

柔情痞子 提交于 2019-11-30 04:37:38
问题 So, AVX has a function from immintrin.h , which should allow to store the concatenation of two __m128i values into a single __m256i value. The function is __m256i _mm256_set_m128i (__m128i hi, __m128i lo) However, when I use it, like so: __m256i as[2]; __m128i s[4]; as[0] = _mm256_setr_m128i(s[0], s[1]); I get a compilation error: error: incompatible types when assigning to type ‘__m256i’ from type ‘int’ I don't really understand why this happens. Any help is greatly appreciated! 回答1: Not all

How to align stack at 32 byte boundary in GCC?

谁都会走 提交于 2019-11-30 04:28:49
问题 I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

馋奶兔 提交于 2019-11-30 03:28:51
问题 I have implemented an inline function ( _mm256_concat_epi16 ). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would be better to use a single inline function in my main program. The question is : Is there any better solution than mine or any suggestion to make this inline function more general which works on 16 values instead of my solution that works on 8

8 bit shift operation in AVX2 with shifting in zeros

前提是你 提交于 2019-11-30 03:02:33
问题 Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has