simd | 易学教程

Efficient way of rotating a byte inside an AVX register

阅读更多关于 Efficient way of rotating a byte inside an AVX register

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the register 1 bit to the left, and 7 to the right individually. I then use the blend operation (intrinsic

Does using mix of pxor and xorps affect performance?

阅读更多关于 Does using mix of pxor and xorps affect performance?

I've come across a fast CRC computation using PCLMULQDQ implementation . I see, that guys mix pxor and xorps instructions heavily like in the fragment below: movdqa xmm10, [rk9] movdqa xmm8, xmm0 pclmulqdq xmm0, xmm10, 0x11 pclmulqdq xmm8, xmm10, 0x0 pxor xmm7, xmm8 xorps xmm7, xmm0 movdqa xmm10, [rk11] movdqa xmm8, xmm1 pclmulqdq xmm1, xmm10, 0x11 pclmulqdq xmm8, xmm10, 0x0 pxor xmm7, xmm8 xorps xmm7, xmm1 Is there any practical reason for this? Performance boost? If yes, then what lies beneath this? Or maybe it's just a sort of coding style, for fun? Peter Cordes TL:DR: it looks like maybe

CUDA: Avoiding serial execution on branch divergence

阅读更多关于 CUDA: Avoiding serial execution on branch divergence

问题 Assume a CUDA kernel executed by a single warp (for simplicity) reaches an if - else statement, where 20 of the threads within the warp satisfy condition and 32 - 20 = 12 threads do not: if (condition){ statement1; // executed by 20 threads else{ statement2; // executed by 12 threads } According to the CUDA C Programming Guide: A warp executes one common instruction at a time [...] if threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch

How do you move 128-bit values between XMM registers?

阅读更多关于 How do you move 128-bit values between XMM registers?

Seemingly trivial problem in assembly: I want to copy the whole XMM0 register to XMM3. I've tried movdq xmm3, xmm0 but MOVDQ cannot be used to move values between two XMM registers. What should I do instead? It's movapd , movaps , or movdqa movaps xmm3, xmm0 They all do the same thing, but there's a catch: movapd and movaps operate in the floating-point domain. movdqa operates in the integer domain Use the appropriate one according to your datatype to avoid domain-changing stalls. Also, there's no reason to use movapd . Always use movaps instead because movapd takes an extra byte to encode. 来源

How do initialize an SIMD vector with a range from 0 to N?

阅读更多关于 How do initialize an SIMD vector with a range from 0 to N?

问题 I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str

Choice between aligned vs. unaligned x86 SIMD instructions

阅读更多关于 Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax] But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions

Why the OpenMP SIMD directive reduces performance?

阅读更多关于 Why the OpenMP SIMD directive reduces performance?

问题 I am learning how to use SIMD directives with OpenMP/Fortran. I wrote the simple code: program loop implicit none integer :: i,j real*8 :: x x = 0.0 do i=1,10000 do j=1,10000000 x = x + 1.0/(1.0*i) enddo enddo print*, x end program loop when I compile this code and run it I get: ifort -O3 -vec-report3 -xhost loop_simd.f90 loop_simd.f90(10): (col. 12) remark: LOOP WAS VECTORIZED loop_simd.f90(9): (col. 7) remark: loop was not vectorized: not inner loop time ./a.out 97876060.8355515 real 0m8

Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

阅读更多关于 Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

问题 As part of a compression algorithm, I am looking for the optimal way to achieve the following: I have a simple bitmap in a uint8_t . For example 01010011 What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint) One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap. I am wondering if there is a better way? 回答1: Here is a solution (PaulR improved my solution,

Fast 7x7 2D Median Filter in C / C++

阅读更多关于 Fast 7x7 2D Median Filter in C / C++

问题 I'm trying to convert the following code from matlab to c++ function data = process(data) data = medfilt2(data, [7 7], 'symmetric'); mask = fspecial('gaussian', [35 35], 12); data = imfilter(data, mask, 'replicate', 'same'); maximum = max(data(:)); data = 1 ./ ( data/maximum ); data(data > 10) = 16; end my problem in the medfilt2 - which is a 2D median filter, I need it to support 10 bits per pixels and more images. 1. I have looked into openCV it has a 5x5 median filter which supports 16

parallelizing matrix multiplication through threading and SIMD

阅读更多关于 parallelizing matrix multiplication through threading and SIMD

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication: void sequentialMatMul(void* params) { cout << "SequentialMatMul started."; int i, j, k; for (i = 0; i < N; i++) { for (k = 0; k < N; k++) { for (j = 0; j < N; j++) { X[i][j] += A[i][k] * B[k][j]; } } } cout << "\nSequentialMatMul finished."; } I tried to add threading and SIMD to matrix multiplication as follows: void threadedSIMDMatMul(void* params) { bounds *args = (bounds*