simd | 易学教程

Is it possible to use SIMD instruction for replace?

阅读更多关于 Is it possible to use SIMD instruction for replace?

问题 I have vector of int and I need to find and replace some elements with specific value. Both of them are the same . For example: replace 4 to 8 for all elements. I'm trying direct memory access in loop in c++. But it still to slow for me. Update: I'm working with OpenCV Mat object on x86 : for (int i = 0; i < labels.rows; ++i) { for (int j = 0; j < labels.cols; ++j) { int& label = labels.at<int>(i, j); if (label == oldValue) { label = newValue; } } } Mat.at() function just return value by

_mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

阅读更多关于 _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

问题 As you know, the first two are AVX-specific intrinsics and the second is a SSE4.1 intrinsic. Both sets of intrinsics can be used to check for equality of 2 floating-point vectors. My specific use case is: _mm_cmpeq_ps or _mm_cmpeq_pd , followed by _mm_testc_ps or _mm_testc_pd on the result, with an appropriate mask But AVX provides equivalents for "legacy" intrinsics, so I might be able to use _mm_testc_si128 , after a cast of the result to __m128i . My questions are, which of the two use

Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

阅读更多关于 Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

问题 I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms: Alg_A - No Vector support: #include <stdio.h> #define SIZE 1000000000 int main() { printf("Algorithm with NO vector support\n"); int a[] = {1, 2, 3, 4}; int b[] = {5, 6, 7, 8}; int i = 0; printf("Running loop %d times\n", SIZE); for (; i < SIZE; i++) { a[0] = a[0] + b

selectively xor-ing elements of a list with AVX2 instructions

阅读更多关于 selectively xor-ing elements of a list with AVX2 instructions

问题 I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so. I am given a large array uint64_t data[100000] of uint64_t's, and an array unsigned char indices[100000] of bytes. I want to output an array uint64_t Out[256] where the i-th value is the xor of all data[j] such that index[j]=i . A straightforward implementation of what I want is this: uint64_t Out[256] = {0}; // initialize output array for (i = 0; i < 100000 ; i++) { Out[Indices[i]] ^

How can I set __m128i without using of any SSE instruction?

阅读更多关于 How can I set __m128i without using of any SSE instruction?

I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization. So is it possible to initialize these constants without using of SSE?

Fast byte-wise replace if

阅读更多关于 Fast byte-wise replace if

I have a function that copies binary data from one area to another, but only if the bytes are different from a specific value. Here is a code sample: void copy_if(char* src, char* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { if (src[i] != ignore) dest[i] = src[i]; } } The problem is that this is too slow for my current need. Is there a way to obtain the same result in a faster way? Update: Based on answers I tried two new implementations: void copy_if_vectorized(const uint8_t* src, uint8_t* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { char

what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

阅读更多关于 what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

问题 I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256 . The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the advantages of loadu ? In general how are these functions different? 回答1: There's no reason to ever use _mm256_lddqu_si256 , consider it a synonym for _mm256_loadu_si256 . lddqu only exists for historical reasons as x86 evolved towards having better

adding the components of an SSE register

阅读更多关于 adding the components of an SSE register

I want to add the four components of an SSE register to get a single float. This is how I do it now: float a[4]; _mm_storeu_ps(a, foo128); float x = a[0] + a[1] + a[2] + a[3]; Is there an SSE instruction that directly achieves this? You could probably use the HADDPS SSE3 instruction, or its compiler intrinsic _mm_hadd_ps , For example, see http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.80).aspx If you have two registers v1 and v2 : v = _mm_hadd_ps(v1, v2); v = _mm_hadd_ps(v, v); Now, v[0] contains the sum of v1's components, and v[1] contains the sum of v2's components. If you want your

How do initialize an SIMD vector with a range from 0 to N?

阅读更多关于 How do initialize an SIMD vector with a range from 0 to N?

I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str_length in order to use SVML's _mm_rem_epu64 in order to calculate v for each loop iteration. How do I

CUDA: Avoiding serial execution on branch divergence

阅读更多关于 CUDA: Avoiding serial execution on branch divergence

Assume a CUDA kernel executed by a single warp (for simplicity) reaches an if - else statement, where 20 of the threads within the warp satisfy condition and 32 - 20 = 12 threads do not: if (condition){ statement1; // executed by 20 threads else{ statement2; // executed by 12 threads } According to the CUDA C Programming Guide : A warp executes one common instruction at a time [...] if threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads