sse | 易学教程

Most performant way to subtract one array from another

阅读更多关于 Most performant way to subtract one array from another

问题 I have the following code which is the bottleneck in one part of my application. All I do is subtract on Array from another. Both of these arrays have more around 100000 elements. I'm trying to find a way to make this more performant. var Array1, Array2 : array of integer; ..... // Code that fills the arrays ..... for ix := 0 to length(array1)-1 Array1[ix] := Array1[ix] - Array2[ix]; end; Does anybody have a suggestion? 回答1: I was very curious about speed optimisation in this simple case. So

An accumulated computing error in SSE version of algorithm of the sum of squared differences

阅读更多关于 An accumulated computing error in SSE version of algorithm of the sum of squared differences

I was trying to optimize following code (sum of squared differences for two arrays): inline float Square(float value) { return value*value; } float SquaredDifferenceSum(const float * a, const float * b, size_t size) { float sum = 0; for(size_t i = 0; i < size; ++i) sum += Square(a[i] - b[i]); return sum; } So I performed optimization with using of SSE instructions of CPU: inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum) { __m128 _a = _mm_loadu_ps(a + i); __m128 _b = _mm_loadu_ps(b + i); __m128 _d = _mm_sub_ps(_a, _b); sum = _mm_add_ps(sum, _mm_mul_ps(

How to speed up calculation of integral image?

阅读更多关于 How to speed up calculation of integral image?

I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride; } } And I have a question. Can I speed up this algorithm (for example, with using of SSE or AVX)?

Shuffle even and odd vaues in SSE register

阅读更多关于 Shuffle even and odd vaues in SSE register

I load two SSE 128bit registers with 16 bit values. The values are in the following order: src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0] src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4] What I want to achieve is an order like this: src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0] src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0] Did you know if there is a good way to do this (by using SSE intrinsics up to SSE 4.2)? I'm stuck at the moment, because I can't shuffle 16 bit values between the upper and lower half of the 128bit register. I found only the _mm_shufflelo_epi16 and _mm_shufflehi

What's the differrence among cflgs sse options of -msse, -msse2, -mssse3, -msse4 rtc..? and how to determine?

阅读更多关于 What's the differrence among cflgs sse options of -msse, -msse2, -mssse3, -msse4 rtc..? and how to determine?

问题 For the GCC CFLAGS options: -msse , -msse2 , -mssse3 , -msse4 , -msse4.1 , -msse4.2 . Are they exclusive in their use or can the be used together? My understanding is that the choosing which to set depends on whether the target arch, which the program will run on, supports it or not, is this correct? If so, how could I know what sse my target arch supports? In Linux, I cat /proc/cpuinfo, but what if mac or Windows? Thanks! 回答1: The -m switched can be used in parallel, furthermore some of them

The best way to shift a __m128i?

阅读更多关于 The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately: r0 := v0 << count r1 := v1 << count so the last bits of v0 missed, but I want to move those bits to r1. Edit: I looking for a code, faster than this (m<64): r0 = v0 << m; r1 = v0 >> (64-m); r1 ^= v1 << m; r2 = v1 >> (64-m); For compile-time constant shift counts, you can get fairly good results. Otherwise not really. This is just an SSE

Should I use SIMD or vector extensions or something else?

阅读更多关于 Should I use SIMD or vector extensions or something else?

问题 I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question. Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all

SIMD (SSE) instruction for division in GCC

阅读更多关于 SIMD (SSE) instruction for division in GCC

I'd like to optimize the following snippet using SSE instructions if possible: /* * the data structure */ typedef struct v3d v3d; struct v3d { double x; double y; double z; } tmp = { 1.0, 2.0, 3.0 }; /* * the part that should be "optimized" */ tmp.x /= 4.0; tmp.y /= 4.0; tmp.z /= 4.0; Is this possible at all? I've used SIMD extension under windows, but have not yet under linux. That being said you should be able to take advantage of the DIVPS SSE operation which will divide a 4 float vector by another 4 float vector. But you are using doubles, so you'll want the SSE2 version DIVPD . I almost

SSE 4 popcount for 16 8-bit values?

阅读更多关于 SSE 4 popcount for 16 8-bit values?

I have the following code which compiles with GCC using the flag -msse4 but the problem is that the pop count only gets the last four 8-bits of the converted __m128i type. Basically what I want is to count all 16 numbers inside the __m128i type but I'm not sure what intrinsic function call to make after creating the variable popA . Somehow popA has to be converted into an integer that contains all the 128-bits of information? I suppose theres _mm_cvtsi128_si64 and using a few shuffle few operations but my OS is 32-bit. Is there only the shuffle method and using _mm_cvtsi128_si32 ? EDIT: If the

SSE Bilinear interpolation

阅读更多关于 SSE Bilinear interpolation

I'm implementing bilinear interpolation in a tight loop and trying to optimize it with SSE, but I get zero speed-up from it. Here is the code, the non-SIMD version uses a simple vector structure which could be defined as struct Vec3f { float x, y, z; } with implemented multiplication and addition operators: #ifdef USE_SIMD const Color c11 = pixelCache[y1 * size.x + x1]; const Color c12 = pixelCache[y2 * size.x + x1]; const Color c22 = pixelCache[y2 * size.x + x2]; const Color c21 = pixelCache[y1 * size.x + x2]; __declspec(align(16)) float mc11[4] = { 1.0, c11.GetB(), c11.GetG(), c11.GetR() };