intrinsics | 易学教程

What are _mm_prefetch() locality hints?

阅读更多关于 What are _mm_prefetch() locality hints?

问题 The intrinsics guide says only this much about void _mm_prefetch (char const* p, int i) : Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. Could you list the possible values for int i parameter and explain their meanings? I've found _MM_HINT_T0 , _MM_HINT_T1 , _MM_HINT_T2 , _MM_HINT_NTA and _MM_HINT_ENTA , but I don't know whether this is an exhaustive list and what they mean. If processor-specific, I would like

c++ SSE SIMD framework [closed]

阅读更多关于 c++ SSE SIMD framework [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place. EDIT I already know the intrinsics provided by the compilers. What I need is a convenient

Semantics of __ddiv_ru

阅读更多关于 Semantics of __ddiv_ru

问题 From the documentation of __ddiv_ru I expect that the following code result is ceil(8/32) = 1.0, instead I obtain 0.25. #include <iostream> using namespace std; __managed__ double x; __managed__ double y; __managed__ double r; __global__ void ceilDiv() { r = __ddiv_ru(x,y); } int main() { x = 8; y = 32; r = -1; ceilDiv<<<1,1>>>(); cudaDeviceSynchronize(); cout << "The ceil of " << x << "/" << y << " is " << r << endl; return 1; } What am I missing? 回答1: The result you are obtaining is correct

Intrinsics for 128 multiplication and division

阅读更多关于 Intrinsics for 128 multiplication and division

问题 In x86_64 I know that the mul and div opp codes support 128 integers by putting the lower 64 bits in the rax and the upper in the rdx registers. I was looking for some sort of intrinsic to do this in the intel intrinsics guide and I could not find one. I am writing a big number library where the word size is 64 bits. Right now I am doing division by a single word like this. int ubi_div_i64(ubigint_t* a, ubi_i64_t b, ubi_i64_t* rem) { if(b == 0) return UBI_MATH_ERR; ubi_i64_t r = 0; for(size_t

_mm_cvtsd_f64 analogon for higher order floating point

阅读更多关于 _mm_cvtsd_f64 analogon for higher order floating point

问题 I'm playing around with SIMD and wonder why there is no analogon to _mm_cvtsd_f64 to extrat the higher order floating point from a __m128d. GCC 4.6+ has an extension which achieves this in a nice way: __m128d a = ...; double d1 = a[0]; double d2 = a[1]; But on older GCC (i.e 4.4.) the only way I could manage to get this is to define my own analogon function using __builtin_ia32_vec_ext_v2df, i.e.: extern __inline double __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm

How to use RDRAND intrinsics?

阅读更多关于 How to use RDRAND intrinsics?

问题 I was looking at H.J. Lu's PATCH: Update x86 rdrand intrinsics. I can't tell if I should be using _rdrand_u64 , _rdrand64_step , or if there are other function(s). There does not appear to be test cases written for them. There also seems to be a lack of man pages (from Ubuntu 14, GCC 4.8.4): $ man -k rdrand rdrand: nothing appropriate. How does one use the RDRAND intrinsics to generate, say, a block of 32 bytes? A related question is RDRAND and RDSEED intrinsics GCC and Intel C++. But it does

A faster integer SSE unalligned load that's rarely used [duplicate]

阅读更多关于 A faster integer SSE unalligned load that's rarely used [duplicate]

问题 This question already has an answer here : what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256 (1 answer) Closed 2 years ago . I would like to know more about the _mm_lddqu_si128 intrinsic ( lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) . I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line

Branching on constexpr evaluation / overloading on constexpr

阅读更多关于 Branching on constexpr evaluation / overloading on constexpr

问题 The setup: I have a function that uses SIMD intrinsics and would like to use it inside some constexpr functions. For that, I need to make it constexpr. However, the SIMD intrinsics are not marked constexpr, and the constant evaluator of the compiler cannot handle them. I tried replacing the SIMD intrinsics with a C++ constexpr implementation that does the same thing. The function became 3.5x slower at run-time, but I was able to use it at compile-time (yay?). The problem : How can I use this

What is meant by “fixing up” floats?

阅读更多关于 What is meant by “fixing up” floats?

问题 I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples: _mm512_fixupimm_pd, _mm512_mask_fixupimm_pd, _mm512_maskz_fixupimm_pd _mm512_fixupimm_round_pd, _mm512_mask_fixupimm_round_pd, _mm512_maskz_fixupimm_round_pd What is meant here by "fixing up"? 回答1: That's a great question. Intel's answer (my bold) is here: This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that

How to load a pixel struct into an SSE register?

阅读更多关于 How to load a pixel struct into an SSE register?

问题 I have a struct of 8-bit pixel data: struct __attribute__((aligned(4))) pixels { char r; char g; char b; char a; } I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers? 回答1: Unpacking unsigned pixels with SSE2 Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register: __m128i xmm0 = _mm_cvtsi32_si128(*