simd

SIMD the following code

≯℡__Kan透↙ 提交于 2020-01-22 13:54:31
问题 How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot: int sum_naive( int n, int *a ) { int sum = 0; for( int i = 0; i < n; i++ ) sum += a[i]; return sum; } 回答1: Here's a fairly straightforward implementation (warning: untested code): int32_t sum_array(const int32_t a[], const int n) { __m128i vsum = _mm_set1_epi32(0); // initialise vector of four partial 32 bit sums int32_t sum; int i; for (i =

Intel SIMD - How can I check if an __m256* contains any non-zero values

坚强是说给别人听的谎言 提交于 2020-01-21 12:09:10
问题 I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work. 回答1: _mm256_testz_ps just tests the sign bits - in order to test the values you'll need to compare against 0 and then extract the resulting mask, e.g. __m256 vcmp = _mm256_cmp_ps(*pSrc1, _mm256_set1_ps(0.0f), _CMP_EQ_OQ); int mask = _mm256_movemask_ps(vcmp); bool any_nz = mask != 0xff; 来源:

Finding lists of prime numbers with SIMD - SSE/AVX

余生长醉 提交于 2020-01-21 05:26:05
问题 I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX. The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I

can someone explain this SSE BigNum comparison?

爷,独闯天下 提交于 2020-01-17 10:12:43
问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

can someone explain this SSE BigNum comparison?

自古美人都是妖i 提交于 2020-01-17 10:10:07
问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

Select unique/deduplication in SSE/AVX

我与影子孤独终老i 提交于 2020-01-13 08:27:11
问题 Problem Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions? Example We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}. Restrictions Stablility . Preservation of the input order is of no significance. Output . However, any removed values/NULLs must be at the beginning and/or end of the register: {null, 1, 2, 3} - OK {1, 2, null, null} - OK {null, 2, null, null} - OK

parallelizing matrix multiplication through threading and SIMD

。_饼干妹妹 提交于 2020-01-13 08:13:10
问题 I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication: void sequentialMatMul(void* params) { cout << "SequentialMatMul started."; int i, j, k; for (i = 0; i < N; i++) { for (k = 0; k < N; k++) { for (j = 0; j < N; j++) { X[i][j] += A[i][k] * B[k][j]; } } } cout << "\nSequentialMatMul finished."; } I tried to add threading and SIMD

Optimising an 1D heat equation using SIMD

那年仲夏 提交于 2020-01-13 05:35:33
问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Optimising an 1D heat equation using SIMD

久未见 提交于 2020-01-13 05:35:11
问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Using STL vector with SIMD intrinsic data type

China☆狼群 提交于 2020-01-13 03:14:13
问题 As the title reads, I am trying to use STL vector with SIMD intrinsic data type. I know it is not a good practice due to the potential overhead of load/store, but I encountered a quite weird fault. Here is the code: #include "immintrin.h" #include <vector> #include <stdio.h> #define VL 8 int main () { std::vector<__m256> vec_1(10); std::vector<__m256> vec_2(10); float * tmp_1 = new float[VL]; printf("vec_1[0]:\n"); _mm256_storeu_ps(tmp_1, vec_1[0]); // seems to go as expected for (int i = 0;