sse

accessing __m128 fields across compilers

我的未来我决定 提交于 2019-12-06 03:56:04
I've noticed that accessing __m128 fields by index is possible in gcc , without using the union trick. __m128 t; float r(t[0] + t[1] + t[2] + t[3]); I can also load a __m128 just like an array: __m128 t{1.f, 2.f, 3.f, 4.f}; This is all in line with gcc 's vector extensions. These, however, may not be available elsewhere. Are the loading and accessing features supported by the intel compiler and msvc? To load a __m128 , you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f) , which is supported by GCC, ICC, MSVC and clang. So far as I know, clang and recent versions of GCC support accessing __m128

How to speed up calculation of integral image?

烈酒焚心 提交于 2019-12-06 03:22:21
问题 I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride

Determine cause of segfault when using -O3?

我只是一个虾纸丫 提交于 2019-12-06 02:51:21
问题 I'm having trouble determining the cause of a segfault when a program is compiled with -O3 with GCC 4.8/4.9/5.1. For GCC 4.9.x, I've seen it on Cygwin, Debian 8 (x64) and Fedora 21 (x64). Others have experienced it on GCC 4.8 and 5.1. The program is fine under -O2 , fine with other versions of GCC, and fine under other compilers (like MSVC, ICC and Clang). Below is the crash under GDB, but nothing is jumping out at me. The source code from misc.cpp:26 is below, but its a simple XOR: ((word64*

The best way to shift a __m128i?

a 夏天 提交于 2019-12-06 02:40:18
问题 I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?! Note that _mm_slli_epi64 shifts v0 and v1 seperately: r0 := v0 << count r1 := v1 << count so the last bits of v0 missed, but I want to move those bits to r1. Edit: I looking for a code, faster than this (m<64): r0 = v0 << m; r1 = v0 >> (64-m); r1 ^= v1 << m; r2 = v1 >> (64-m); 回答1: For compile-time

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

泪湿孤枕 提交于 2019-12-06 02:09:45
I would like to know if the following is possible in any of the SIMD families of instructions. I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake. The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array. In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv

Why is my straightforward quaternion multiplication faster than SSE?

廉价感情. 提交于 2019-12-06 01:24:43
问题 I've been going through a few different quaternion multiplication implementations, but I've been rather surprised to see that the reference implementation is, so far, my fastest. This is the implementation in question: inline static quat multiply(const quat& lhs, const quat& rhs) { return quat((lhs.w * rhs.x) + (lhs.x * rhs.w) + (lhs.y * rhs.z) - (lhs.z * rhs.y), (lhs.w * rhs.y) + (lhs.y * rhs.w) + (lhs.z * rhs.x) - (lhs.x * rhs.z), (lhs.w * rhs.z) + (lhs.z * rhs.w) + (lhs.x * rhs.y) - (lhs.y

Add a constant value to a xmm register in x86

谁说我不能喝 提交于 2019-12-06 01:24:00
How would I add 1 or 2 to the register xmm0 (double)? I can do it like this, but sure there must be an easier way: movsd xmm0, [ecx] xor eax, eax inc eax cvtsi2sd xmm1, eax addsd xmm0, xmm1 movsd [ecx], xmm0 Also would it be possible to do this with the floating point x87 instructions? This doesn't work for me: fld dword ptr [ecx] fld1 faddp fstp dword ptr [ecx] You can keep a constant in memory or in another register: _1 dq 1.0 and addsd xmm1,[_1] or movsd xmm0,[_1] addsd xmm1,xmm0 If you are on x64, you can do this: mov rax,1.0 movq xmm0,rax addsd xmm1,xmm0 or use the stack if the type

Invoking native code with hand-written assembly

旧巷老猫 提交于 2019-12-06 00:29:13
I'm trying to call a native function from a managed assembly. I've done this on pre-compiled libraries and everything has went well. At this moment I'm building my own library, and I can't get this work. The native DLL source is the following: #define DERM_SIMD_EXPORT __declspec(dllexport) #define DERM_SIMD_API __cdecl extern "C" { DERM_SIMD_EXPORT void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right); } void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right) { __asm { .... } } Hereafter we have the managed code which loads the library

AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?

蹲街弑〆低调 提交于 2019-12-05 23:43:49
问题 I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1. I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm

SSE Bilinear interpolation

心已入冬 提交于 2019-12-05 23:37:41
问题 I'm implementing bilinear interpolation in a tight loop and trying to optimize it with SSE, but I get zero speed-up from it. Here is the code, the non-SIMD version uses a simple vector structure which could be defined as struct Vec3f { float x, y, z; } with implemented multiplication and addition operators: #ifdef USE_SIMD const Color c11 = pixelCache[y1 * size.x + x1]; const Color c12 = pixelCache[y2 * size.x + x1]; const Color c22 = pixelCache[y2 * size.x + x2]; const Color c21 = pixelCache