intrinsics

What is the fastest way for adding the vector elements horizontally in odd order?

纵然是瞬间 提交于 2019-12-23 04:15:28
问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is

Why should you not access the __m128i fields directly?

风格不统一 提交于 2019-12-22 06:48:50
问题 I was reading this on MSDN, and it says You should not access the __m128i fields directly. You can, however, see these types in the debugger. A variable of type __m128i maps to the XMM[0-7] registers. However, it doesn't explain why. Why is it? For example, is the following "bad": void func(unsigned short x, unsigned short y) { __m128i a; a.m128i_i64[0] = x; __m128i b; b.m128i_i64[0] = y; // Now do something with a and b ... } Instead of doing the assignments like in the example above, should

How can I access SHA intrinsic?

落花浮王杯 提交于 2019-12-22 06:48:01
问题 Gprof tells me that my computationally heavy program spends most of it's time (36%) hashing using AP-Hash. I can't reduce the call count but I would still like to make it faster, can I call intrinsic SHA from a c program? Do I need the intel compiler or can I stick with gcc? 回答1: SHA instructions are now available in Goldmont architecture. It was released around September, 2016. According to the Intel Intrinsics Guide, these are the intrinsics of interest: __m128i _mm_sha1msg1_epu32 (__m128i

-O3 in ICC messes up intrinsics, fine with -O1 or -O2 or with corresponding manual assembly

旧街凉风 提交于 2019-12-22 06:12:03
问题 This is a followup on this question. The code below for a 4x4 matrix multiplication C = AB compiles fine on ICC on all optimization settings. It executes correctly on -O1 and -O2, but gives an incorrect result on -O3. The problem seems to come from the _mm256_storeu_pd operation, as substituting it (and only it) with the asm statement below gives correct results after execution. Any ideas? inline void RunIntrinsics_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C) {

-O3 in ICC messes up intrinsics, fine with -O1 or -O2 or with corresponding manual assembly

微笑、不失礼 提交于 2019-12-22 06:10:05
问题 This is a followup on this question. The code below for a 4x4 matrix multiplication C = AB compiles fine on ICC on all optimization settings. It executes correctly on -O1 and -O2, but gives an incorrect result on -O3. The problem seems to come from the _mm256_storeu_pd operation, as substituting it (and only it) with the asm statement below gives correct results after execution. Any ideas? inline void RunIntrinsics_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C) {

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

自闭症网瘾萝莉.ら 提交于 2019-12-22 05:13:34
问题 I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd . Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this: #ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined? // use _mm_permute_pd # else // use _mm

Slower SSE performance on large array sizes

删除回忆录丶 提交于 2019-12-21 20:09:36
问题 I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i

Make compiler copy characters using movsd

梦想与她 提交于 2019-12-21 11:01:35
问题 I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd . However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the

Initializing an __m128 type from a 64-bit unsigned int

拈花ヽ惹草 提交于 2019-12-21 09:14:07
问题 The _mm_set_epi64 and similar *_epi64 instructions seem to use and depend on __m64 types. I want to initialize a variable of type __m128 such that the upper 64 bits of it are 0, and the lower 64 bits of it are set to x , where x is of type uint64_t (or similar unsigned 64-bit type). What's the "right" way of doing so? Preferably, this should be done in a compiler-independent manner. 回答1: To answser your question about how to load a 64-bit value into the lower 64-bits of a XMM register while

Why and when to use __noop?

早过忘川 提交于 2019-12-21 07:12:12
问题 I was reading about __noop and the MSDN example is #if DEBUG #define PRINT printf_s #else #define PRINT __noop #endif int main() { PRINT("\nhello\n"); } and I don't see the gain over just having an empty macro: #define PRINT The generated code is the same. What's a valid example of using __noop that actually makes it useful? 回答1: The __noop intrinsic specifies that a function should be ignored and the argument list be parsed but no code be generated for the arguments. It is intended for use