sse | 易学教程

How to efficiently perform double/int64 conversions with SSE/AVX?

阅读更多关于 How to efficiently perform double/int64 conversions with SSE/AVX?

问题 SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing: _mm_cvtpd_epi64() _mm_cvtepi64_pd() It seems that AVX doesn't have them either. What is the most efficient way to simulate these intrinsics? 回答1: There's no single instruction until AVX512 , which added conversion to/from 64-bit integers, signed or unsigned.

Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

阅读更多关于 Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

问题 I am trying to find the most efficient implementation of 4x4 matrix (M) multiplication with a vector (u) using SSE. I mean Mu = v. As far as I understand there are two primary ways to go about this: method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u) method 2) v = u1 col1 + u2 col2 + u3 col3 + u4 col4. Method 2 is easy to implement in SSE2. Method 1 can be implement with either the horizontal add instruction in SSE3 or the dot product instruction in SSE4.

Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

阅读更多关于 Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

问题 Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc. The work is, essentially, dot products. As in, I have two float[50] and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE

How to implement atoi using SIMD?

阅读更多关于 How to implement atoi using SIMD?

问题 I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places. If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling. The algorithm I came up with is the following: I can initialize a vector of length N in the

How to check if a CPU supports the SSE3 instruction set?

阅读更多关于 How to check if a CPU supports the SSE3 instruction set?

问题 Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP (see http://msdn.microsoft.com/en-us/library/ms724482(v=vs.85).aspx). bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info ids __cpuid(CPUInfo, 0); int nIds = CPUInfo[0]; //-- Get info for id "1" if (nIds >= 1) { __cpuid(CPUInfo, 1); bool bSSE3NewInstructions = (CPUInfo[2] & 0x1) || false; return

How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

阅读更多关于 How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

问题 This question already has answers here : Load address calculation when using AVX2 gather instructions (3 answers) Closed last year . Intel's Intrinsic Guide says: __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale) And: Description Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are

fast compact register using sse

阅读更多关于 fast compact register using sse

问题 I am trying to figure out how to use sse _mm_shuffle_epi8 to compact a 128-bit register. Let's say, I have an input variable __m128i target which is basically 8 16-bits, indicated as: a[0], a[1] ... a[7]. // each slot is 16 bits my output is called: __m128i output Now I have a bit-vector of size 8: char bit_mask // 8 bits, i-th bit each indicate if // the corresponding a[i] should be included OK, how can I get the final result based on the bit_mask and the input target? assume my bitvector is

ARM Neon: conditional store suggestion

阅读更多关于 ARM Neon: conditional store suggestion

问题 I'm trying to figure out how to generate a conditional Store in ARM neon. What I would like to do is the equivalent of this SSE instruction: void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); which Conditionally stores byte elements of d to address p.The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. Any suggestion on how to do it with NEON intrinsics? Thank you This is what I did: int8x16_t store_mask = {0,0,0,0,0,0,0xff,0xff,0xff

Compare the sign bit in SSE Intrinsics

阅读更多关于 Compare the sign bit in SSE Intrinsics

问题 How would one create a mask using SSE intrinsics which indicates whether the signs of two packed floats (__m128's) are the same for example if comparing a and b where a is [1.0 -1.0 0.0 2.0] and b is [1.0 1.0 1.0 1.0] the desired mask we would get is [true false true true]. 回答1: Here's one solution: const __m128i MASK = _mm_set1_epi32(0xffffffff); __m128 a = _mm_setr_ps(1,-1,0,2); __m128 b = _mm_setr_ps(1,1,1,1); __m128 f = _mm_xor_ps(a,b); __m128i i = _mm_castps_si128(f); i = _mm_srai_epi32

_MM_TRANSPOSE4_PS causes compiler errors in GCC?

阅读更多关于 _MM_TRANSPOSE4_PS causes compiler errors in GCC?

问题 I'm compiling my math library in GCC instead of MSVC for the first time and going through all the little errors, and I've hit one that simply makes no sense: Line 284: error: lvalue required as left operand of assignment What's on line 284? this: _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f)); (r, u, and t are all instances of __m128 ) Those familiar with using xmmintrin.h will be aware that _MM_TRANSPOSE4_PS isn't actually a function, but rather a macro, which expands to: /*