sse | 易学教程

What are the 128-bit to 512-bit registers used for?

阅读更多关于 What are the 128-bit to 512-bit registers used for?

问题 After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512. After doing a bit of digging what I've gathered is that you have to use 2 64 bit operations in order to perform math on a 128 bit number, instead of using generic add , sub , mul , div operations. If this is the case, then what

Reverse byte order in XMM or YMM register?

阅读更多关于 Reverse byte order in XMM or YMM register?

问题 Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers. Is there a way to reverse the byte order in an XMM or YMM register? 回答1: Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all

Efficient sse shuffle mask generation for left-packing byte elements

阅读更多关于 Efficient sse shuffle mask generation for left-packing byte elements

问题 What would be an efficient way to optimize the following code with sse ? uint16_t change1= ... ; uint8_t* pSrc = ... ; uint8_t* pDest = ... ; if(change1 & 0x0001) *pDest++ = pSrc[0]; if(change1 & 0x0002) *pDest++ = pSrc[1]; if(change1 & 0x0004) *pDest++ = pSrc[2]; if(change1 & 0x0008) *pDest++ = pSrc[3]; if(change1 & 0x0010) *pDest++ = pSrc[4]; if(change1 & 0x0020) *pDest++ = pSrc[5]; if(change1 & 0x0040) *pDest++ = pSrc[6]; if(change1 & 0x0080) *pDest++ = pSrc[7]; if(change1 & 0x0100) *pDest

How to reverse an __m128 type variable?

阅读更多关于 How to reverse an __m128 type variable?

问题 I know this should be a Googling question but I just cannot find the answer. Say I have an __m128 variable a , whose content is a[0] , a[1] , a[2] , a[3] . Is there a single function that can reverse it to be a[3] , a[2] , a[1] , a[0] ? 回答1: Use _mm_shuffle_ps(). This instruction was already available in SSE and can gather 4 32-bit components in a single vector by combining two arbitrary 32-bit components from each of the two input vectors. How to create the mask using the macro _MM_SHUFFLE()

SIMD: Accumulate Adjacent Pairs

阅读更多关于 SIMD: Accumulate Adjacent Pairs

问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

SSE2 double multiplication slower than with standard multiplication

阅读更多关于 SSE2 double multiplication slower than with standard multiplication

问题 I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code: m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pData = (__m128d*)input().data; __m128d* pWin = (__m128d*)m_win; __m128d* pOut = (__m128d*)m_output.data; __m128d tmp; int i=0; for(; i<m_size/2;i++) pOut[i] = _mm_mul_pd(pData[i], pWin[i]); The memory for m_output.data and input().data has been allocated with _aligned_malloc.

Can we use non-temporal mov instructions on heap memory?

阅读更多关于 Can we use non-temporal mov instructions on heap memory?

问题 In Agner Fog's "Optimizing subroutines in assembly language - section 11.8 Cache control instructions," he says: "Memory writes are more expensive than reads when cache misses occur in a write-back cache. A whole cache line has to be read from memory, modified, and written back in case of a cache miss. This can be avoided by using the non-temporal write instructions MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS . These instructions should be used when writing to a memory location that is unlikely

Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

阅读更多关于 Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

问题 Given n , I want to zero out the last n bytes of a __m128i vector. For instance consider the following __m128i vector: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 After zeroing out the last n = 4 bytes, the vector should look like: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000 Is there a SSE

Multiply-add vectorization slower with AVX than with SSE

阅读更多关于 Multiply-add vectorization slower with AVX than with SSE

问题 I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this: for( int i = 0; i < size; i++ ) { c[i] += (double)a[i] * (double)b[i]; } Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why. Here's

On x86-64, is the “movnti” or “movntdq” instruction atomic when system crash?

阅读更多关于 On x86-64, is the “movnti” or “movntdq” instruction atomic when system crash?

问题 When using persistent memory like Intel optane DCPMM, is it possible to see partial result after reboot if system crash(power outage) in execution of movnt instruction? For: 4 or 8 byte movnti which x86 guarantees atomic for other purposes? 16-byte SSE movntdq / movntps which aren't guaranteed atomic but which in practice probably are on CPUs supporting persistent memory. 32-byte AVX vmovntdq / vmovntps 64-byte AVX512 vmovntdq / vmovntps full-line stores bonus question: MOVDIR64B which has