sse | 易学教程

AVX/SSE version of xorshift128+

阅读更多关于 AVX/SSE version of xorshift128+

I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what

Flipping sign on packed SSE floats

阅读更多关于 Flipping sign on packed SSE floats

I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register. I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried. For each case I looped over the code 10 billion times and got the wall-time indicated. I'm trying to at least match 4 seconds it takes my non-SIMD approach, which is using just the unary minus operator. [48 sec] _mm_sub_ps( _mm_setzero_ps(), vec ); [32 sec] _mm_mul_ps( _mm_set1_ps( -1.0f ), vec ); [9 sec] union NegativeMask { int intRep; float fltRep; }

Are different mmx, sse and avx versions complementary or supersets of each other?

阅读更多关于 Are different mmx, sse and avx versions complementary or supersets of each other?

问题 I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I

How to move 128-bit immediates to XMM registers

阅读更多关于 How to move 128-bit immediates to XMM registers

There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too. The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value? Norbert P. Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language , Generating constants, section 13.8, page 134. Paul R You can do it like this, with just one movaps instruction: .section .rodata # put your

Shuffling by mask with Intel AVX

阅读更多关于 Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register. The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2): {(0,0),(1,1),(1,4),(2,5),...} This means several bytes are copied twice. I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes. __m256 _mm256

Handling zeroes in _mm256_rsqrt_ps()

阅读更多关于 Handling zeroes in _mm256_rsqrt_ps()

问题 Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps() , looking around it seems that doing: _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); Is the way to go for that extra bit of performance and avoiding a pipeline stall. Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0) . What is the best way around this? I have tried this (which works and is faster), but is there a better

Alignment and SSE strange behaviour

阅读更多关于 Alignment and SSE strange behaviour

问题 I try to work with SSE and i faced with some strange behaviour. I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary. //Compare two different, not overlapping piece of memory __attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size) { //Skip tail for right

Auto vectorization not working

阅读更多关于 Auto vectorization not working

问题 I'm trying to get my code to auto vectorize, but it isn't working. int _tmain(int argc, _TCHAR* argv[]) { const int N = 4096; float x[N]; float y[N]; float sum = 0; //create random values for x and y for (int i = 0; i < N; i++) { x[i] = rand() >> 1; y[i] = rand() >> 1; } for (int i = 0; i < N; i++){ sum += x[i] * y[i]; } } Neither loop vectorizes here, but I'm really only interested in the second loop. I'm using visual studio express 2013 and am compiling with the /O2 and /Qvec-report:2 (To

Testing equality between two __m128i variables

阅读更多关于 Testing equality between two __m128i variables

If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use == ? If not, which SSE instruction should I use? Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction . To do this you could do this: if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) { //v0 == v1 } Edit: as Paul R pointed out _mm_test_all_ones generates two instructions: pcmpeqd and ptest . With _mm

How to allocate 16byte memory aligned data

阅读更多关于 How to allocate 16byte memory aligned data

I am trying to implement SSE vectorization on a piece of code for which I need my 1D array to be 16 byte memory aligned. However, I have tried several ways to allocate 16byte memory aligned data but it ends up being 4byte memory aligned. I have to work with the Intel icc compiler. This is a sample code I am testing with: #include <stdio.h> #include <stdlib.h> void error(char *str) { printf("Error:%s\n",str); exit(-1); } int main() { int i; //float *A=NULL; float *A = (float*) memalign(16,20*sizeof(float)); //align // if (posix_memalign((void **)&A, 16, 20*sizeof(void*)) != 0) // error("Cannot