sse

multiplication using SSE (x*x*x)+(y*y*y)

廉价感情. 提交于 2020-01-06 14:18:10
问题 I'm trying to optimize this function using SIMD but I don't know where to start. long sum(int x,int y) { return x*x*x+y*y*y; } The disassembled function looks like this: 4007a0: 48 89 f2 mov %rsi,%rdx 4007a3: 48 89 f8 mov %rdi,%rax 4007a6: 48 0f af d6 imul %rsi,%rdx 4007aa: 48 0f af c7 imul %rdi,%rax 4007ae: 48 0f af d6 imul %rsi,%rdx 4007b2: 48 0f af c7 imul %rdi,%rax 4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax 4007ba: c3 retq 4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) The calling code

Matrix Multiplication of size 100*100 using SSE Intrinsics

狂风中的少年 提交于 2020-01-06 06:51:45
问题 int MAX_DIM = 100; float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); /* * I fill these arrays with some values */ for(int i=0;i<MAX_DIM;i+=1){ for(int j=0;j<MAX_DIM;j+=4){ for(int k=0;k<MAX_DIM;k+=4){ __m128 result = _mm_load_ps(&d[i][j]); __m128 a_line = _mm_load_ps(&a[i][k]); __m128 b_line0 = _mm_load_ps(&b[k][j+0]); __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]); __m128 b_line2

How to use _mm_extract_epi8 function? [duplicate]

核能气质少年 提交于 2020-01-06 04:43:06
问题 This question already has an answer here : _mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument (1 answer) Closed 11 months ago . I am using _mm_extract_epi8 (__m128i a, const int imm8) function, which has const int parameter. When I compile this c++ code, getting the following error message: Error C2057 expected constant expression __m128i a; for (int i=0; i<16; i++) { _mm_extract_epi8(a, i); // compilation error } How could I use this function in loop? 回答1: First of

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

痴心易碎 提交于 2020-01-05 15:19:14
问题 I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >= 10.6 So I'd have to code FPU/SSE/AVX code and OpenCL separately to produce two binaries: one without and one with OpenCL support. It would be much better, if I could compile OpenCL at compiletime into SSE/AVX and then ship a binary without OpenCL in

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

╄→尐↘猪︶ㄣ 提交于 2020-01-05 15:18:46
问题 I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >= 10.6 So I'd have to code FPU/SSE/AVX code and OpenCL separately to produce two binaries: one without and one with OpenCL support. It would be much better, if I could compile OpenCL at compiletime into SSE/AVX and then ship a binary without OpenCL in

different results with and without SSE ( float arrays multiplication)

不羁的心 提交于 2020-01-04 08:06:12
问题 I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332. Can you explain why the results are different? And what can I do with functions to have the same result? function with SSE float** sse_multiplication(float** array1, float** array2, float** arraycheck) { int i, j, k; float *ms1, *ms2, result; float *end_loop; for( i = 0; i <

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-04 05:17:11
问题 Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a)); . For instance: void MyFunction(float y) { __m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128 //do whatever it is with 'a' } I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles ( __m128d )? 回答1: Almost every _ss and _ps intrinsic / instruction has a double

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

百般思念 提交于 2020-01-04 05:17:06
问题 Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a)); . For instance: void MyFunction(float y) { __m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128 //do whatever it is with 'a' } I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles ( __m128d )? 回答1: Almost every _ss and _ps intrinsic / instruction has a double

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

泪湿孤枕 提交于 2020-01-03 18:18:19
问题 The Intel intrinsics guide states simply that _mm512_load_epi32 : Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst and that _mm512_load_si512 : Load[s] 512-bits of integer data from memory into dst What is the difference between these two? The documentation isn't clear. 回答1: There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can