sse | 易学教程

Addressing a non-integer address, and sse

阅读更多关于 Addressing a non-integer address, and sse

问题 I am trying to accelerate my code using sse, and the following code works well. Basically a __m128 variable should point to 4 floats in a row, in order to do 4 operations at once. This code is equivalent to computing c[i]=a[i]+b[i] with i from 0 to 3 . float *data1,*data2,*data3 // ... code ... allocating data1-2-3 which are very long. __m128* a = (__m128*) (data1); __m128* b = (__m128*) (data2); __m128* c = (__m128*) (data3); *c = _mm_add_ps(*a, *b); However, when I want to shift a bit the

How to rewrite this code to sse intrinsics

阅读更多关于 How to rewrite this code to sse intrinsics

问题 Im new in sse intrinsics and would appreciate some hints assistance in using this 9as this is yet foggy to me) I got such code for(int k=0; k<=n-4; k+=4) { int xc0 = 512 + ((idx + k*iddx)>>6); int yc0 = 512 + ((idy + k*iddy)>>6); int xc1 = 512 + ((idx + (k+1)*iddx)>>6); int yc1 = 512 + ((idy + (k+1)*iddy)>>6); int xc2 = 512 + ((idx + (k+2)*iddx)>>6); int yc2 = 512 + ((idy + (k+2)*iddy)>>6); int xc3 = 512 + ((idx + (k+3)*iddx)>>6); int yc3 = 512 + ((idy + (k+3)*iddy)>>6); unsigned color0 =

SSE2 shift by vector

阅读更多关于 SSE2 shift by vector

问题 I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector. To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that: r1 = v1 << s1 r2 = v2 << s2 ... rn = vn << sn since it appears that _mm_sll_epi* performs this: r1 = v1 << s1 r2 = v2 << s1 ... rn = vn << s1 Thanks in

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

阅读更多关于 SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

问题 I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling. I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0) From IACA output I guess the difference is that IVB uses port 1 and 5 for

Help me improve some more SSE2 code

阅读更多关于 Help me improve some more SSE2 code

问题 I am looking for some help to improve this bilinear scaling sse2 code on core2 cpus On my Atom N270 and on an i7 this code is about 2x faster than the mmx code. But under core2 cpus it is only equal to the mmx code. Code follows void ConversionProcess::convert_SSE2(BBitmap *from, BBitmap *to) { uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr; ULLint start = rdtsc(); ULLint stop; if (from && to) { uint32 width, height; width = from->Bounds().IntegerWidth() + 1; height = from->Bounds()

Compare two 16-byte values for equality using up to SSE 4.2?

阅读更多关于 Compare two 16-byte values for equality using up to SSE 4.2?

问题 I have a struct like this: struct { uint32_t a; uint16_t b; uint16_t c; uint16_t d; uint8_t e; } s; and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs. Could somebody please advise the best approach? I can add a union to my struct to make processing easier. I am limited (for now) to using

Why should you not access the __m128i fields directly?

阅读更多关于 Why should you not access the __m128i fields directly?

问题 I was reading this on MSDN, and it says You should not access the __m128i fields directly. You can, however, see these types in the debugger. A variable of type __m128i maps to the XMM[0-7] registers. However, it doesn't explain why. Why is it? For example, is the following "bad": void func(unsigned short x, unsigned short y) { __m128i a; a.m128i_i64[0] = x; __m128i b; b.m128i_i64[0] = y; // Now do something with a and b ... } Instead of doing the assignments like in the example above, should

How to compare more than two numbers in parallel?

阅读更多关于 How to compare more than two numbers in parallel?

问题 Is it possible to compare more than a pair of numbers in one instruction using SSE4? Intel Reference says the following about PCMPGTQ PCMPGTQ — Compare Packed Data for Greater Than Performs an SIMD compare for the packed quadwords in the destination operand (first operand) and the source operand (second operand). If the data element in the first (destination) operand is greater than the corresponding element in the second (source) operand, the corresponding data element in the destination is

Efficient way of rotating a byte inside an AVX register

阅读更多关于 Efficient way of rotating a byte inside an AVX register

问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the

Strange uint32_t to float array conversion

阅读更多关于 Strange uint32_t to float array conversion

问题 I have the following code snippet: #include <cstdio> #include <cstdint> static const size_t ARR_SIZE = 129; int main() { uint32_t value = 2570980487; uint32_t arr[ARR_SIZE]; for (int x = 0; x < ARR_SIZE; ++x) arr[x] = value; float arr_dst[ARR_SIZE]; for (int x = 0; x < ARR_SIZE; ++x) { arr_dst[x] = static_cast<float>(arr[x]); } printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!"); printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]); printf("magic = %0.10f\n", arr