sse | 易学教程

Slower SSE performance on large array sizes

阅读更多关于 Slower SSE performance on large array sizes

问题 I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i

An accumulated computing error in SSE version of algorithm of the sum of squared differences

阅读更多关于 An accumulated computing error in SSE version of algorithm of the sum of squared differences

问题 I was trying to optimize following code (sum of squared differences for two arrays): inline float Square(float value) { return value*value; } float SquaredDifferenceSum(const float * a, const float * b, size_t size) { float sum = 0; for(size_t i = 0; i < size; ++i) sum += Square(a[i] - b[i]); return sum; } So I performed optimization with using of SSE instructions of CPU: inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum) { __m128 _a = _mm_loadu_ps(a +

How can I set __m128i without using of any SSE instruction?

阅读更多关于 How can I set __m128i without using of any SSE instruction?

问题 I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash

Initializing an __m128 type from a 64-bit unsigned int

阅读更多关于 Initializing an __m128 type from a 64-bit unsigned int

问题 The _mm_set_epi64 and similar *_epi64 instructions seem to use and depend on __m64 types. I want to initialize a variable of type __m128 such that the upper 64 bits of it are 0, and the lower 64 bits of it are set to x , where x is of type uint64_t (or similar unsigned 64-bit type). What's the "right" way of doing so? Preferably, this should be done in a compiler-independent manner. 回答1: To answser your question about how to load a 64-bit value into the lower 64-bits of a XMM register while

vector of __mm128 won't push_back()

阅读更多关于 vector of __mm128 won't push_back()

问题 This simple SSE code: #include <vector> #include <emmintrin.h> int main() { std::vector<__m128> blah; blah.push_back(__m128()); } Crashes on MSVC 10 with a segfault at 0xffffffff . What could be going wrong ? 回答1: A std::vector does not allocate specially aligned memory, which __m128 needs to store it's data. You will have to either swap out the allocator, or replace it with an array of 4 floats and then perform an unaligned load or copy out to an aligned location every time you access the

vector of __mm128 won't push_back()

阅读更多关于 vector of __mm128 won't push_back()

__m256d TRANSPOSE4 Equivalent?

阅读更多关于 __m256d TRANSPOSE4 Equivalent?

问题 Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner. _MM_TRANSPOSE4_PS Code #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \ __m128 tmp3, tmp2, tmp1, tmp0; \ \ tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \ tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \ tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \ tmp3 = _mm_shuffle_ps((row2), (row3),

How many cycle does need for put a data into SIMD register?

阅读更多关于 How many cycle does need for put a data into SIMD register?

问题 I'm a student who learning x86 and ARM architecture. And I was wondering that how many cycle does need for putting multiple datas into SIMD registers? I understand that x86 SSE's xmms register has 128 bit size of register. What if I want to put 32 of 8 bit of data into one of xmms register from the stack via SIMD instruction set and via assembly language, does it have same amount of cycle time for general purpose register's PUSH/POP? or does it needs 32x of time for each 8bit of data? Thank

Finding the most frequently occurring element in an SSE register

阅读更多关于 Finding the most frequently occurring element in an SSE register

问题 Does anyone have any thoughts on how to calculate the mode (statistic) of a vector of 8-bit integers in SSE4.x? To clarify, this would be 16x8-bit values in a 128-bit register. I want the result as a vector mask which selects the mode-valued elements. i.e. the result of _mm_cmpeq_epi8(v, set1(mode(v))) , as well as the scalar value. Providing some additional context; while the above problem is an interesting one to solve in its own right, I have been through most algorithms I can think of

128-bit values - From XMM registers to General Purpose

阅读更多关于 128-bit values - From XMM registers to General Purpose

问题 I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers? movq RAX XMM1 ; 0th bit to 63th bit mov? RCX XMM1 ; 64th bit to 127th bit Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers? movd EAX XMM1 ; 0th bit to 31th bit mov? ECX