simd | 易学教程

SIMD the following code

阅读更多关于 SIMD the following code

问题 How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot: int sum_naive( int n, int *a ) { int sum = 0; for( int i = 0; i < n; i++ ) sum += a[i]; return sum; } 回答1: Here's a fairly straightforward implementation (warning: untested code): int32_t sum_array(const int32_t a[], const int n) { __m128i vsum = _mm_set1_epi32(0); // initialise vector of four partial 32 bit sums int32_t sum; int i; for (i =

Intel SIMD - How can I check if an __m256* contains any non-zero values

阅读更多关于 Intel SIMD - How can I check if an __m256* contains any non-zero values

问题 I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work. 回答1: _mm256_testz_ps just tests the sign bits - in order to test the values you'll need to compare against 0 and then extract the resulting mask, e.g. __m256 vcmp = _mm256_cmp_ps(*pSrc1, _mm256_set1_ps(0.0f), _CMP_EQ_OQ); int mask = _mm256_movemask_ps(vcmp); bool any_nz = mask != 0xff; 来源：

Finding lists of prime numbers with SIMD - SSE/AVX

阅读更多关于 Finding lists of prime numbers with SIMD - SSE/AVX

问题 I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX. The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I

can someone explain this SSE BigNum comparison?

阅读更多关于 can someone explain this SSE BigNum comparison?

问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

can someone explain this SSE BigNum comparison?

阅读更多关于 can someone explain this SSE BigNum comparison?

Select unique/deduplication in SSE/AVX

阅读更多关于 Select unique/deduplication in SSE/AVX

问题 Problem Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions? Example We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}. Restrictions Stablility . Preservation of the input order is of no significance. Output . However, any removed values/NULLs must be at the beginning and/or end of the register: {null, 1, 2, 3} - OK {1, 2, null, null} - OK {null, 2, null, null} - OK

parallelizing matrix multiplication through threading and SIMD

阅读更多关于 parallelizing matrix multiplication through threading and SIMD

问题 I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication: void sequentialMatMul(void* params) { cout << "SequentialMatMul started."; int i, j, k; for (i = 0; i < N; i++) { for (k = 0; k < N; k++) { for (j = 0; j < N; j++) { X[i][j] += A[i][k] * B[k][j]; } } } cout << "\nSequentialMatMul finished."; } I tried to add threading and SIMD

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

Using STL vector with SIMD intrinsic data type

阅读更多关于 Using STL vector with SIMD intrinsic data type

问题 As the title reads, I am trying to use STL vector with SIMD intrinsic data type. I know it is not a good practice due to the potential overhead of load/store, but I encountered a quite weird fault. Here is the code: #include "immintrin.h" #include <vector> #include <stdio.h> #define VL 8 int main () { std::vector<__m256> vec_1(10); std::vector<__m256> vec_2(10); float * tmp_1 = new float[VL]; printf("vec_1[0]:\n"); _mm256_storeu_ps(tmp_1, vec_1[0]); // seems to go as expected for (int i = 0;