sse2 | 易学教程

Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

阅读更多关于 Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

问题 I have the following problem which I need to solve using anything other than AVX2. I have 3 values stored in a m128i variable (the 4th value is not needed ) and need to shift those values by 4,3,5. I need two functions. One for the right logical shift by those values and another for the left logical shift. Does anyone know a solution to the problem using SSE/AVX ? The only thing I could find was _mm_srlv_epi32() which is AVX2. To add a little more information. Here is the code I am trying to

SSE2 instructions not working in inline assembly with C++

阅读更多关于 SSE2 instructions not working in inline assembly with C++

问题 I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs: template<typename T> void simdAdd(T *lhs,T *rhs) { asm volatile("movups %0,%%xmm0"::"m"(lhs)); asm volatile("movups %0,%%xmm1"::"m"(rhs)); switch(sizeof(T)) { case sizeof(uint8_t): asm volatile("paddb %%xmm0,%%xmm1":); break; case sizeof(uint16_t): asm volatile("paddw %%xmm0,%%xmm1":); break; case sizeof(float): asm volatile("addps %%xmm0,%%xmm1":);

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

阅读更多关于 How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

问题 The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure

SSE - AVX conversion from double to char

阅读更多关于 SSE - AVX conversion from double to char

问题 I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32

How to make the following code faster

阅读更多关于 How to make the following code faster

问题 int u1, u2; unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long res1, res2 initialized to zero. l = 60; while (l) { for (i = 0; i < 20; i += 2) { u1 = (elm1[i] >> l) & 15; u2 = (elm1[i + 1] >> l) & 15; for (k = 0; k < 20; k += 2) { simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]); simdb = _mm_load_si128 ((__m128i *) &res1[i + k]); simdb = _mm_xor_si128 (simda, simdb); _mm_store_si128 ((__m128i *)&res1[i + k], simdb); simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]

The best way to shift a __m128i?

阅读更多关于 The best way to shift a __m128i?

问题 I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately: r0 := v0 << count r1 := v1 << count so the last bits of v0 missed, but I want to move those bits to r1. Edit: I looking for a code, faster than this (m<64): r0 = v0 << m; r1 = v0 >> (64-m); r1 ^= v1 << m; r2 = v1 >> (64-m); 回答1: For compile-time

Optimizing RGB565 to RGB888 conversions with SSE2

阅读更多关于 Optimizing RGB565 to RGB888 conversions with SSE2

I'm trying to optimize pixel depth conversion from 565 to 888 using SSE2 with the basic formula: col8 = col5 << 3 | col5 >> 2 col8 = col6 << 2 | col6 >> 4 I take two 2x565 128-bit vectors and I'm outputing 3x888 128-bit vectors. After some masking, shifting and OR'ing I came to the point when I have two vectors with ((blue << 8) | red)* 8-bit colors stored in 16-bit words and a similar vectors with zero-green values. Now I need to combine them into 888 output. BR: BR7-BR6-...-BR1-BR0 0G: 0G7-0G7-...-0G1-0G0 | v OUT1: R5-BGR4-...-BGR1-BGR0 In SSSE3 there is a _mm_shuffle_epi8() which solves my

Using % with SSE2?

阅读更多关于 Using % with SSE2?

Here's the code I'm trying to convert to SSE2: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left = audioLeft; double *right = audioRight; double phase = 0.0; double bp0 = mNoteFrequency * mHostPitch; for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) { // some other code (that will use phase) phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI); while (phase >= TWOPI) { phase -= TWOPI; } } Here's what I've achieved: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left =

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

阅读更多关于 How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also,

SSE - AVX conversion from double to char

阅读更多关于 SSE - AVX conversion from double to char

I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32(il_si,_MM_SHUFFLE(3,1,2,0)); ih_si = _mm_packs_epi32(_mm_unpacklo_epi32(il_si,ih_si),_mm_unpackhi