intrinsics

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

旧时模样 提交于 2019-12-05 08:46:42
The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also,

-O3 in ICC messes up intrinsics, fine with -O1 or -O2 or with corresponding manual assembly

点点圈 提交于 2019-12-05 07:26:21
This is a followup on this question . The code below for a 4x4 matrix multiplication C = AB compiles fine on ICC on all optimization settings. It executes correctly on -O1 and -O2, but gives an incorrect result on -O3. The problem seems to come from the _mm256_storeu_pd operation, as substituting it (and only it) with the asm statement below gives correct results after execution. Any ideas? inline void RunIntrinsics_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C) { size_t i; /* the registers you use */ __m256d a0, a1, a2, a3, b0, b1, b2, b3, sum; // __m256d *C256 = (_

How do I perform 8 x 8 matrix operation using SSE?

你说的曾经没有我的故事 提交于 2019-12-05 05:33:29
My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

“Custom intrinsic” function for x64 instead of inline assembly possible?

痴心易碎 提交于 2019-12-05 04:58:11
I am currently experimenting with the creation of highly-optimized, reusable functions for a library of mine. For instance, I write the function "is power of 2" the following way: template<class IntType> inline bool is_power_of_two( const IntType x ) { return (x != 0) && ((x & (x - 1)) == 0); } This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches: is_power_of_two PROC test rcx, rcx je SHORT $LN3@is_power_o lea rax, QWORD PTR [rcx-1] test rax, rcx jne SHORT $LN3@is_power_o mov al, 1 ret 0 $LN3@is

How do initialize an SIMD vector with a range from 0 to N?

自古美人都是妖i 提交于 2019-12-05 04:34:54
问题 I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

ぃ、小莉子 提交于 2019-12-05 04:32:09
I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd . Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this: #ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined? // use _mm_permute_pd # else // use _mm_shuffle_pd #endif ? I have found this tutorial , which shows how to perform a runtime check but I need to do

Funnel shift - what is it?

百般思念 提交于 2019-12-04 18:05:07
问题 When reading through CUDA 5.0 Programming Guide I stumbled on a feature called "Funnel shift" which is present in 3.5 compute-capable device, but not 3.0. It contains an annotation "see reference manual", but when I search for the "funnel shift" term in the manual, I don't find anything. I tried googling for it, but only found a mention on http://www.cudahandbook.com, in the chapter 8: 8.2.3 Funnel Shift (SM 3.5) GK110 added a 64-bit “funnel shift” instruction that may be accessed with the

SIMD and difference between packed and scalar double precision

 ̄綄美尐妖づ 提交于 2019-12-04 16:09:48
问题 I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points. What does "packed" mean? Do I need to pack my float values somehow before I can use them? For double precision there are intrinsics like _mm_cmpeq_sd which means compare the "lower" double precision floating point elements. What does lower and

Slower SSE performance on large array sizes

馋奶兔 提交于 2019-12-04 15:23:01
I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){ xmm0 = _mm_loadu_si128(++src); accumulator = _mm_add_epi32

gcc, simd intrinsics and fast-math concepts

心已入冬 提交于 2019-12-04 09:19:17
问题 Hi all :) I'm trying to get a hang on a few concepts regarding floating point, SIMD/math intrinsics and the fast-math flag for gcc. More specifically, I'm using MinGW with gcc v4.5.0 on a x86 cpu. I've searched around for a while now, and that's what I (think I) understand at the moment: When I compile with no flags, any fp code will be standard x87, no simd intrinsics, and the math.h functions will be linked from msvcrt.dll. When I use mfpmath , mssen and/or march so that mmx/sse/avx code