simd | 易学教程

What are the 128-bit to 512-bit registers used for?

阅读更多关于 What are the 128-bit to 512-bit registers used for?

问题 After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512. After doing a bit of digging what I've gathered is that you have to use 2 64 bit operations in order to perform math on a 128 bit number, instead of using generic add , sub , mul , div operations. If this is the case, then what

Efficient sse shuffle mask generation for left-packing byte elements

阅读更多关于 Efficient sse shuffle mask generation for left-packing byte elements

问题 What would be an efficient way to optimize the following code with sse ? uint16_t change1= ... ; uint8_t* pSrc = ... ; uint8_t* pDest = ... ; if(change1 & 0x0001) *pDest++ = pSrc[0]; if(change1 & 0x0002) *pDest++ = pSrc[1]; if(change1 & 0x0004) *pDest++ = pSrc[2]; if(change1 & 0x0008) *pDest++ = pSrc[3]; if(change1 & 0x0010) *pDest++ = pSrc[4]; if(change1 & 0x0020) *pDest++ = pSrc[5]; if(change1 & 0x0040) *pDest++ = pSrc[6]; if(change1 & 0x0080) *pDest++ = pSrc[7]; if(change1 & 0x0100) *pDest

How to reverse an __m128 type variable?

阅读更多关于 How to reverse an __m128 type variable?

问题 I know this should be a Googling question but I just cannot find the answer. Say I have an __m128 variable a , whose content is a[0] , a[1] , a[2] , a[3] . Is there a single function that can reverse it to be a[3] , a[2] , a[1] , a[0] ? 回答1: Use _mm_shuffle_ps(). This instruction was already available in SSE and can gather 4 32-bit components in a single vector by combining two arbitrary 32-bit components from each of the two input vectors. How to create the mask using the macro _MM_SHUFFLE()

SIMD: Accumulate Adjacent Pairs

阅读更多关于 SIMD: Accumulate Adjacent Pairs

问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

Compiling library with SSE2 and AVX2

阅读更多关于 Compiling library with SSE2 and AVX2

问题 Using VS2015 and compiling a library that has both SSE2 instructions and AVX2 instructions (that are only used if detected in the CPU), if I compile the library with /arch:AVX2 but only call the SSE2 instructions I get "illegal instruction" (on _mm_set1_epi32 first SSE2 instruction called). However, if I compile the lib with /arch:SSE2 it works fine when calling the SSE2 instructions. Are the arch settings mutually exclusive? If not how should this be fixed? I have attempted both as a shared

Why AVX dot product slower than native C++ code

阅读更多关于 Why AVX dot product slower than native C++ code

问题 I have the following AVX and Native codes: __forceinline double dotProduct_2(const double* u, const double* v) { _mm256_zeroupper(); __m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v)); __m256d temp = _mm256_hadd_pd(xy, xy); __m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1)); return dotproduct.m128d_f64[0]; } __forceinline double dotProduct_1(const D3& a, const D3& b) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3]; }

Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

阅读更多关于 Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

问题 Given n , I want to zero out the last n bytes of a __m128i vector. For instance consider the following __m128i vector: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 After zeroing out the last n = 4 bytes, the vector should look like: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000 Is there a SSE

Implementation of bit rotate operators using SIMD in CUDA

阅读更多关于 Implementation of bit rotate operators using SIMD in CUDA

问题 I know that StackOverflow is not meant for asking code to other persons, but let me speak. I am trying to implement some AES functions in CUDA C++ device code. While trying to implement the left bytewise rotate operator, I was disconcerted to see that there was no native SIMD intrisic for that. So I began a naive implementation, but....it's huge, and while I haven't tried it yet, it just won't be fast because of the expensive unpacking/packing... So, is there a mean to do a per byte bit

Find min/max value from a __m128i

阅读更多关于 Find min/max value from a __m128i

问题 I want to find the minimum/maximum value into an array of byte using SIMD operations. So far I was able to go through the array and store the minimum/maximum value into a __m128i variable, but it means that the value I am looking for is mixed among others (15 others to be exact). I've found these discussions here and here for integer, and this page for float, but I don't understand how works _mm_shuffle*. So my questions are: What SIMD operations do I have to perform in order to extract the

Find min/max value from a __m128i

阅读更多关于 Find min/max value from a __m128i