sse | 易学教程

multiplication using SSE (xxx)+(yyy)

阅读更多关于 multiplication using SSE (x*x*x)+(y*y*y)

问题 I'm trying to optimize this function using SIMD but I don't know where to start. long sum(int x,int y) { return x*x*x+y*y*y; } The disassembled function looks like this: 4007a0: 48 89 f2 mov %rsi,%rdx 4007a3: 48 89 f8 mov %rdi,%rax 4007a6: 48 0f af d6 imul %rsi,%rdx 4007aa: 48 0f af c7 imul %rdi,%rax 4007ae: 48 0f af d6 imul %rsi,%rdx 4007b2: 48 0f af c7 imul %rdi,%rax 4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax 4007ba: c3 retq 4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) The calling code

Matrix Multiplication of size 100*100 using SSE Intrinsics

阅读更多关于 Matrix Multiplication of size 100*100 using SSE Intrinsics

问题 int MAX_DIM = 100; float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); /* * I fill these arrays with some values */ for(int i=0;i<MAX_DIM;i+=1){ for(int j=0;j<MAX_DIM;j+=4){ for(int k=0;k<MAX_DIM;k+=4){ __m128 result = _mm_load_ps(&d[i][j]); __m128 a_line = _mm_load_ps(&a[i][k]); __m128 b_line0 = _mm_load_ps(&b[k][j+0]); __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]); __m128 b_line2

How to use _mm_extract_epi8 function? [duplicate]

阅读更多关于 How to use _mm_extract_epi8 function? [duplicate]

问题 This question already has an answer here : _mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument (1 answer) Closed 11 months ago . I am using _mm_extract_epi8 (__m128i a, const int imm8) function, which has const int parameter. When I compile this c++ code, getting the following error message: Error C2057 expected constant expression __m128i a; for (int i=0; i<16; i++) { _mm_extract_epi8(a, i); // compilation error } How could I use this function in loop? 回答1: First of

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

阅读更多关于 Can I compile OpenCL code into ordinary, OpenCL-free binaries?

问题 I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >= 10.6 So I'd have to code FPU/SSE/AVX code and OpenCL separately to produce two binaries: one without and one with OpenCL support. It would be much better, if I could compile OpenCL at compiletime into SSE/AVX and then ship a binary without OpenCL in

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

阅读更多关于 Can I compile OpenCL code into ordinary, OpenCL-free binaries?

Count positives from float vector using _mm_cmpgt_pd

阅读更多关于 Count positives from float vector using _mm_cmpgt_pd

问题 I'm trying to make a program using intrinsics that counts the >0 elements of a float vector, Thank you all for your time. 来源： https://stackoverflow.com/questions/47461547/count-positives-from-float-vector-using-mm-cmpgt-pd

different results with and without SSE ( float arrays multiplication)

阅读更多关于 different results with and without SSE ( float arrays multiplication)

问题 I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332. Can you explain why the results are different? And what can I do with functions to have the same result? function with SSE float** sse_multiplication(float** array1, float** array2, float** arraycheck) { int i, j, k; float *ms1, *ms2, result; float *end_loop; for( i = 0; i <

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

阅读更多关于 Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

问题 Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a)); . For instance: void MyFunction(float y) { __m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128 //do whatever it is with 'a' } I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles ( __m128d )? 回答1: Almost every _ss and _ps intrinsic / instruction has a double

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

阅读更多关于 Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

阅读更多关于 What is the difference between _mm512_load_epi32 and _mm512_load_si512?

问题 The Intel intrinsics guide states simply that _mm512_load_epi32 : Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst and that _mm512_load_si512 : Load[s] 512-bits of integer data from memory into dst What is the difference between these two? The documentation isn't clear. 回答1: There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can