sse | 易学教程

accessing __m128 fields across compilers

阅读更多关于 accessing __m128 fields across compilers

I've noticed that accessing __m128 fields by index is possible in gcc , without using the union trick. __m128 t; float r(t[0] + t[1] + t[2] + t[3]); I can also load a __m128 just like an array: __m128 t{1.f, 2.f, 3.f, 4.f}; This is all in line with gcc 's vector extensions. These, however, may not be available elsewhere. Are the loading and accessing features supported by the intel compiler and msvc? To load a __m128 , you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f) , which is supported by GCC, ICC, MSVC and clang. So far as I know, clang and recent versions of GCC support accessing __m128

How to speed up calculation of integral image?

阅读更多关于 How to speed up calculation of integral image?

问题 I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride

Determine cause of segfault when using -O3?

阅读更多关于 Determine cause of segfault when using -O3?

问题 I'm having trouble determining the cause of a segfault when a program is compiled with -O3 with GCC 4.8/4.9/5.1. For GCC 4.9.x, I've seen it on Cygwin, Debian 8 (x64) and Fedora 21 (x64). Others have experienced it on GCC 4.8 and 5.1. The program is fine under -O2 , fine with other versions of GCC, and fine under other compilers (like MSVC, ICC and Clang). Below is the crash under GDB, but nothing is jumping out at me. The source code from misc.cpp:26 is below, but its a simple XOR: ((word64*

The best way to shift a __m128i?

阅读更多关于 The best way to shift a __m128i?

问题 I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately: r0 := v0 << count r1 := v1 << count so the last bits of v0 missed, but I want to move those bits to r1. Edit: I looking for a code, faster than this (m<64): r0 = v0 << m; r1 = v0 >> (64-m); r1 ^= v1 << m; r2 = v1 >> (64-m); 回答1: For compile-time

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

阅读更多关于 QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

Why is my straightforward quaternion multiplication faster than SSE?

阅读更多关于 Why is my straightforward quaternion multiplication faster than SSE?

问题 I've been going through a few different quaternion multiplication implementations, but I've been rather surprised to see that the reference implementation is, so far, my fastest. This is the implementation in question: inline static quat multiply(const quat& lhs, const quat& rhs) { return quat((lhs.w * rhs.x) + (lhs.x * rhs.w) + (lhs.y * rhs.z) - (lhs.z * rhs.y), (lhs.w * rhs.y) + (lhs.y * rhs.w) + (lhs.z * rhs.x) - (lhs.x * rhs.z), (lhs.w * rhs.z) + (lhs.z * rhs.w) + (lhs.x * rhs.y) - (lhs.y

Add a constant value to a xmm register in x86

阅读更多关于 Add a constant value to a xmm register in x86

How would I add 1 or 2 to the register xmm0 (double)? I can do it like this, but sure there must be an easier way: movsd xmm0, [ecx] xor eax, eax inc eax cvtsi2sd xmm1, eax addsd xmm0, xmm1 movsd [ecx], xmm0 Also would it be possible to do this with the floating point x87 instructions? This doesn't work for me: fld dword ptr [ecx] fld1 faddp fstp dword ptr [ecx] You can keep a constant in memory or in another register: _1 dq 1.0 and addsd xmm1,[_1] or movsd xmm0,[_1] addsd xmm1,xmm0 If you are on x64, you can do this: mov rax,1.0 movq xmm0,rax addsd xmm1,xmm0 or use the stack if the type

Invoking native code with hand-written assembly

阅读更多关于 Invoking native code with hand-written assembly

I'm trying to call a native function from a managed assembly. I've done this on pre-compiled libraries and everything has went well. At this moment I'm building my own library, and I can't get this work. The native DLL source is the following: #define DERM_SIMD_EXPORT __declspec(dllexport) #define DERM_SIMD_API __cdecl extern "C" { DERM_SIMD_EXPORT void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right); } void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right) { __asm { .... } } Hereafter we have the managed code which loads the library

AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?

阅读更多关于 AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?

问题 I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1. I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm

SSE Bilinear interpolation

阅读更多关于 SSE Bilinear interpolation

问题 I'm implementing bilinear interpolation in a tight loop and trying to optimize it with SSE, but I get zero speed-up from it. Here is the code, the non-SIMD version uses a simple vector structure which could be defined as struct Vec3f { float x, y, z; } with implemented multiplication and addition operators: #ifdef USE_SIMD const Color c11 = pixelCache[y1 * size.x + x1]; const Color c12 = pixelCache[y2 * size.x + x1]; const Color c22 = pixelCache[y2 * size.x + x2]; const Color c21 = pixelCache