sse2

Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

大兔子大兔子 提交于 2019-12-11 00:27:07
问题 I have the following problem which I need to solve using anything other than AVX2. I have 3 values stored in a m128i variable (the 4th value is not needed ) and need to shift those values by 4,3,5. I need two functions. One for the right logical shift by those values and another for the left logical shift. Does anyone know a solution to the problem using SSE/AVX ? The only thing I could find was _mm_srlv_epi32() which is AVX2. To add a little more information. Here is the code I am trying to

SSE2 instructions not working in inline assembly with C++

China☆狼群 提交于 2019-12-08 07:21:29
问题 I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs: template<typename T> void simdAdd(T *lhs,T *rhs) { asm volatile("movups %0,%%xmm0"::"m"(lhs)); asm volatile("movups %0,%%xmm1"::"m"(rhs)); switch(sizeof(T)) { case sizeof(uint8_t): asm volatile("paddb %%xmm0,%%xmm1":); break; case sizeof(uint16_t): asm volatile("paddw %%xmm0,%%xmm1":); break; case sizeof(float): asm volatile("addps %%xmm0,%%xmm1":);

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

青春壹個敷衍的年華 提交于 2019-12-07 03:46:31
问题 The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure

SSE - AVX conversion from double to char

梦想的初衷 提交于 2019-12-06 12:41:57
问题 I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32

How to make the following code faster

白昼怎懂夜的黑 提交于 2019-12-06 08:31:53
问题 int u1, u2; unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long res1, res2 initialized to zero. l = 60; while (l) { for (i = 0; i < 20; i += 2) { u1 = (elm1[i] >> l) & 15; u2 = (elm1[i + 1] >> l) & 15; for (k = 0; k < 20; k += 2) { simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]); simdb = _mm_load_si128 ((__m128i *) &res1[i + k]); simdb = _mm_xor_si128 (simda, simdb); _mm_store_si128 ((__m128i *)&res1[i + k], simdb); simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]

The best way to shift a __m128i?

a 夏天 提交于 2019-12-06 02:40:18
问题 I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately: r0 := v0 << count r1 := v1 << count so the last bits of v0 missed, but I want to move those bits to r1. Edit: I looking for a code, faster than this (m<64): r0 = v0 << m; r1 = v0 >> (64-m); r1 ^= v1 << m; r2 = v1 >> (64-m); 回答1: For compile-time

Optimizing RGB565 to RGB888 conversions with SSE2

主宰稳场 提交于 2019-12-05 21:56:11
I'm trying to optimize pixel depth conversion from 565 to 888 using SSE2 with the basic formula: col8 = col5 << 3 | col5 >> 2 col8 = col6 << 2 | col6 >> 4 I take two 2x565 128-bit vectors and I'm outputing 3x888 128-bit vectors. After some masking, shifting and OR'ing I came to the point when I have two vectors with ((blue << 8) | red)* 8-bit colors stored in 16-bit words and a similar vectors with zero-green values. Now I need to combine them into 888 output. BR: BR7-BR6-...-BR1-BR0 0G: 0G7-0G7-...-0G1-0G0 | v OUT1: R5-BGR4-...-BGR1-BGR0 In SSSE3 there is a _mm_shuffle_epi8() which solves my

Using % with SSE2?

懵懂的女人 提交于 2019-12-05 20:52:15
Here's the code I'm trying to convert to SSE2: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left = audioLeft; double *right = audioRight; double phase = 0.0; double bp0 = mNoteFrequency * mHostPitch; for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) { // some other code (that will use phase) phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI); while (phase >= TWOPI) { phase -= TWOPI; } } Here's what I've achieved: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left =

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

旧时模样 提交于 2019-12-05 08:46:42
The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also,

SSE - AVX conversion from double to char

别说谁变了你拦得住时间么 提交于 2019-12-04 18:34:19
I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32(il_si,_MM_SHUFFLE(3,1,2,0)); ih_si = _mm_packs_epi32(_mm_unpacklo_epi32(il_si,ih_si),_mm_unpackhi