sse

What are the 128-bit to 512-bit registers used for?

不羁的心 提交于 2021-02-04 07:12:39
问题 After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512. After doing a bit of digging what I've gathered is that you have to use 2 64 bit operations in order to perform math on a 128 bit number, instead of using generic add , sub , mul , div operations. If this is the case, then what

Reverse byte order in XMM or YMM register?

你说的曾经没有我的故事 提交于 2021-02-04 06:30:06
问题 Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers. Is there a way to reverse the byte order in an XMM or YMM register? 回答1: Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all

Efficient sse shuffle mask generation for left-packing byte elements

人盡茶涼 提交于 2021-02-04 06:19:05
问题 What would be an efficient way to optimize the following code with sse ? uint16_t change1= ... ; uint8_t* pSrc = ... ; uint8_t* pDest = ... ; if(change1 & 0x0001) *pDest++ = pSrc[0]; if(change1 & 0x0002) *pDest++ = pSrc[1]; if(change1 & 0x0004) *pDest++ = pSrc[2]; if(change1 & 0x0008) *pDest++ = pSrc[3]; if(change1 & 0x0010) *pDest++ = pSrc[4]; if(change1 & 0x0020) *pDest++ = pSrc[5]; if(change1 & 0x0040) *pDest++ = pSrc[6]; if(change1 & 0x0080) *pDest++ = pSrc[7]; if(change1 & 0x0100) *pDest

How to reverse an __m128 type variable?

心已入冬 提交于 2021-02-04 05:37:04
问题 I know this should be a Googling question but I just cannot find the answer. Say I have an __m128 variable a , whose content is a[0] , a[1] , a[2] , a[3] . Is there a single function that can reverse it to be a[3] , a[2] , a[1] , a[0] ? 回答1: Use _mm_shuffle_ps(). This instruction was already available in SSE and can gather 4 32-bit components in a single vector by combining two arbitrary 32-bit components from each of the two input vectors. How to create the mask using the macro _MM_SHUFFLE()

SIMD: Accumulate Adjacent Pairs

青春壹個敷衍的年華 提交于 2021-02-02 09:29:36
问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

SSE2 double multiplication slower than with standard multiplication

感情迁移 提交于 2021-01-29 09:42:28
问题 I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code: m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pData = (__m128d*)input().data; __m128d* pWin = (__m128d*)m_win; __m128d* pOut = (__m128d*)m_output.data; __m128d tmp; int i=0; for(; i<m_size/2;i++) pOut[i] = _mm_mul_pd(pData[i], pWin[i]); The memory for m_output.data and input().data has been allocated with _aligned_malloc.

Can we use non-temporal mov instructions on heap memory?

天涯浪子 提交于 2021-01-28 05:08:27
问题 In Agner Fog's "Optimizing subroutines in assembly language - section 11.8 Cache control instructions," he says: "Memory writes are more expensive than reads when cache misses occur in a write-back cache. A whole cache line has to be read from memory, modified, and written back in case of a cache miss. This can be avoided by using the non-temporal write instructions MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS . These instructions should be used when writing to a memory location that is unlikely

Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

感情迁移 提交于 2021-01-27 20:54:30
问题 Given n , I want to zero out the last n bytes of a __m128i vector. For instance consider the following __m128i vector: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 After zeroing out the last n = 4 bytes, the vector should look like: 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000 Is there a SSE

Multiply-add vectorization slower with AVX than with SSE

雨燕双飞 提交于 2021-01-27 06:01:40
问题 I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this: for( int i = 0; i < size; i++ ) { c[i] += (double)a[i] * (double)b[i]; } Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why. Here's

On x86-64, is the “movnti” or “movntdq” instruction atomic when system crash?

痴心易碎 提交于 2021-01-27 05:35:15
问题 When using persistent memory like Intel optane DCPMM, is it possible to see partial result after reboot if system crash(power outage) in execution of movnt instruction? For: 4 or 8 byte movnti which x86 guarantees atomic for other purposes? 16-byte SSE movntdq / movntps which aren't guaranteed atomic but which in practice probably are on CPUs supporting persistent memory. 32-byte AVX vmovntdq / vmovntps 64-byte AVX512 vmovntdq / vmovntps full-line stores bonus question: MOVDIR64B which has