sse

How to properly use prefetch instructions?

旧街凉风 提交于 2019-11-30 08:46:55
问题 I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this: __m128* A, B; __m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0); for(size_t i=0; i<1048576;i+=4) { dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]); dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]); dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]); dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);

inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch

对着背影说爱祢 提交于 2019-11-30 08:31:44
问题 I am trying to compile a C program using cmake which uses SIMD intrinsics. When I try to compile it, I get two errors /usr/lib/gcc/x86_64-linux-gnu/5/include/ smmintrin.h :326:1: error: inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch _mm_mullo_epi32 (__m128i __X, __m128i __Y) /usr/lib/gcc/x86_64-linux-gnu/5/include/ tmmintrin.h :136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch _mm_shuffle

Can one construct a “good” hash function using CRC32C as a base?

ε祈祈猫儿з 提交于 2019-11-30 07:59:15
Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that? Update How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if

How are denormalized floats handled in C#?

梦想与她 提交于 2019-11-30 07:48:23
Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0). There is an option with SSE to round these off to 0, restoring performance when such floating point values are encountered. How do C# apps handle this? Is there an option to enable/disable _MM_FLUSH_ZERO ? There is no such option. The FPU control word in a C# app is initialized by the CLR at startup. Changing it is not an option provided by the framework. Even if you try to change it by pinvoking _control87_2() then it is not going to last

SSE2 intrinsics - comparing unsigned integers

故事扮演 提交于 2019-11-30 07:42:41
问题 I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and saturating the result to 0xFF: __m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */); __m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */); __m128i m3 = _mm_adds_epu8(m1, m2); I would be interested in performing comparison for less than on these unsigned integers, similar to _mm_cmplt_epi8 for signed: __m128i mask = _mm_cmplt_epi8 (m3, m1); m1 = _mm_or_si128(m3, mask); If an "epu8"

Storing two x86 32 bit registers into 128 bit xmm register

风流意气都作罢 提交于 2019-11-30 07:27:09
Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register? movd xmm0, edx movd xmm1, eax pshufd xmm0, xmm0, $1 por xmm0, xmm1 So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678. Thanks Paul R With SSE 4.1 you can use movd xmm0, eax / pinsrd xmm0, edx, 1 and do it in 2 instructions. For older CPUs you can use 2 x movd and then punpckldq for a total of 3 instructions: movd xmm0, edx movd xmm1, eax punpckldq xmm0, xmm1 I don't know much about MMX, but perhaps you want the PACKSSDW instruction. The PACKSSDW instruction takes

indexing into an array with SSE

主宰稳场 提交于 2019-11-30 07:10:26
问题 Suppose I have an array: uint8_t arr[256]; and an element __m128i x containing 16 bytes, x_1, x_2, ... x_16 I would like to efficiently fill a new __m128i element __m128i y with values from arr depending on the values in x , such that: y_1 = arr[x_1] y_2 = arr[x_2] . . . y_16 = arr[x_16] A command to achieve this would essentially be loading a register from a non-contiguous set of memory locations. I have a painfully vague memory of having seen documentation of such a command, but can't find

How do declare a memory range as uncacheable using gcc on x86 platform?

强颜欢笑 提交于 2019-11-30 07:07:15
Although I have read about movntdqa instructions regarding this but have figured out a clean way to express a memory range uncacheable or read data so as to not pollute the cache. I want to do this from gcc. My main goal is to swap to random locations in an large array. Hoping to accelerate this operation by avoiding caching since there is very little data resue. I think what you're describing is Memory Type Range Registers . You can control these under Linux (if available and you're user 0) using /proc/mttr / ioctl(2) see here for an example. As it works on a physical address range I think

How to move 128-bit immediates to XMM registers

匆匆过客 提交于 2019-11-30 07:04:10
问题 There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too. The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value? 回答1: Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language, Generating constants, section 13.8, page 134. 回答2:

How to load a pixel struct into an SSE register?

时光总嘲笑我的痴心妄想 提交于 2019-11-30 06:57:54
I have a struct of 8-bit pixel data: struct __attribute__((aligned(4))) pixels { char r; char g; char b; char a; } I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers? Unpacking unsigned pixels with SSE2 Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register: __m128i xmm0 = _mm_cvtsi32_si128(*(const int*)&pixel); Then first unpack those 8-bit values into 16-bit values in the lower 64 bits of the register