sse

AVX/SSE version of xorshift128+

余生颓废 提交于 2019-11-30 12:41:15
I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what

Flipping sign on packed SSE floats

两盒软妹~` 提交于 2019-11-30 11:42:36
I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register. I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried. For each case I looped over the code 10 billion times and got the wall-time indicated. I'm trying to at least match 4 seconds it takes my non-SIMD approach, which is using just the unary minus operator. [48 sec] _mm_sub_ps( _mm_setzero_ps(), vec ); [32 sec] _mm_mul_ps( _mm_set1_ps( -1.0f ), vec ); [9 sec] union NegativeMask { int intRep; float fltRep; }

Are different mmx, sse and avx versions complementary or supersets of each other?

巧了我就是萌 提交于 2019-11-30 11:32:21
问题 I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I

How to move 128-bit immediates to XMM registers

烂漫一生 提交于 2019-11-30 10:58:23
There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too. The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value? Norbert P. Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language , Generating constants, section 13.8, page 134. Paul R You can do it like this, with just one movaps instruction: .section .rodata # put your

Shuffling by mask with Intel AVX

自古美人都是妖i 提交于 2019-11-30 09:53:37
I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register. The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2): {(0,0),(1,1),(1,4),(2,5),...} This means several bytes are copied twice. I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes. __m256 _mm256

Handling zeroes in _mm256_rsqrt_ps()

折月煮酒 提交于 2019-11-30 09:47:57
问题 Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps() , looking around it seems that doing: _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); Is the way to go for that extra bit of performance and avoiding a pipeline stall. Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0) . What is the best way around this? I have tried this (which works and is faster), but is there a better

Alignment and SSE strange behaviour

大城市里の小女人 提交于 2019-11-30 09:35:53
问题 I try to work with SSE and i faced with some strange behaviour. I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary. //Compare two different, not overlapping piece of memory __attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size) { //Skip tail for right

Auto vectorization not working

只愿长相守 提交于 2019-11-30 09:14:50
问题 I'm trying to get my code to auto vectorize, but it isn't working. int _tmain(int argc, _TCHAR* argv[]) { const int N = 4096; float x[N]; float y[N]; float sum = 0; //create random values for x and y for (int i = 0; i < N; i++) { x[i] = rand() >> 1; y[i] = rand() >> 1; } for (int i = 0; i < N; i++){ sum += x[i] * y[i]; } } Neither loop vectorizes here, but I'm really only interested in the second loop. I'm using visual studio express 2013 and am compiling with the /O2 and /Qvec-report:2 (To

Testing equality between two __m128i variables

不想你离开。 提交于 2019-11-30 09:13:28
If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use == ? If not, which SSE instruction should I use? Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction . To do this you could do this: if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) { //v0 == v1 } Edit: as Paul R pointed out _mm_test_all_ones generates two instructions: pcmpeqd and ptest . With _mm

How to allocate 16byte memory aligned data

拥有回忆 提交于 2019-11-30 08:52:15
I am trying to implement SSE vectorization on a piece of code for which I need my 1D array to be 16 byte memory aligned. However, I have tried several ways to allocate 16byte memory aligned data but it ends up being 4byte memory aligned. I have to work with the Intel icc compiler. This is a sample code I am testing with: #include <stdio.h> #include <stdlib.h> void error(char *str) { printf("Error:%s\n",str); exit(-1); } int main() { int i; //float *A=NULL; float *A = (float*) memalign(16,20*sizeof(float)); //align // if (posix_memalign((void **)&A, 16, 20*sizeof(void*)) != 0) // error("Cannot