intrinsics | 易学教程

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

阅读更多关于 How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also,

-O3 in ICC messes up intrinsics, fine with -O1 or -O2 or with corresponding manual assembly

阅读更多关于 -O3 in ICC messes up intrinsics, fine with -O1 or -O2 or with corresponding manual assembly

This is a followup on this question . The code below for a 4x4 matrix multiplication C = AB compiles fine on ICC on all optimization settings. It executes correctly on -O1 and -O2, but gives an incorrect result on -O3. The problem seems to come from the _mm256_storeu_pd operation, as substituting it (and only it) with the asm statement below gives correct results after execution. Any ideas? inline void RunIntrinsics_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C) { size_t i; /* the registers you use */ __m256d a0, a1, a2, a3, b0, b1, b2, b3, sum; // __m256d *C256 = (_

How do I perform 8 x 8 matrix operation using SSE?

阅读更多关于 How do I perform 8 x 8 matrix operation using SSE?

My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

“Custom intrinsic” function for x64 instead of inline assembly possible?

阅读更多关于 “Custom intrinsic” function for x64 instead of inline assembly possible?

I am currently experimenting with the creation of highly-optimized, reusable functions for a library of mine. For instance, I write the function "is power of 2" the following way: template<class IntType> inline bool is_power_of_two( const IntType x ) { return (x != 0) && ((x & (x - 1)) == 0); } This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches: is_power_of_two PROC test rcx, rcx je SHORT $LN3@is_power_o lea rax, QWORD PTR [rcx-1] test rax, rcx jne SHORT $LN3@is_power_o mov al, 1 ret 0 $LN3@is

How do initialize an SIMD vector with a range from 0 to N?

阅读更多关于 How do initialize an SIMD vector with a range from 0 to N?

问题 I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

阅读更多关于 How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd . Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this: #ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined? // use _mm_permute_pd # else // use _mm_shuffle_pd #endif ? I have found this tutorial , which shows how to perform a runtime check but I need to do

Funnel shift - what is it?

阅读更多关于 Funnel shift - what is it?

问题 When reading through CUDA 5.0 Programming Guide I stumbled on a feature called "Funnel shift" which is present in 3.5 compute-capable device, but not 3.0. It contains an annotation "see reference manual", but when I search for the "funnel shift" term in the manual, I don't find anything. I tried googling for it, but only found a mention on http://www.cudahandbook.com, in the chapter 8: 8.2.3 Funnel Shift (SM 3.5) GK110 added a 64-bit “funnel shift” instruction that may be accessed with the

SIMD and difference between packed and scalar double precision

阅读更多关于 SIMD and difference between packed and scalar double precision

问题 I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points. What does "packed" mean? Do I need to pack my float values somehow before I can use them? For double precision there are intrinsics like _mm_cmpeq_sd which means compare the "lower" double precision floating point elements. What does lower and

Slower SSE performance on large array sizes

阅读更多关于 Slower SSE performance on large array sizes

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){ xmm0 = _mm_loadu_si128(++src); accumulator = _mm_add_epi32

gcc, simd intrinsics and fast-math concepts

阅读更多关于 gcc, simd intrinsics and fast-math concepts

问题 Hi all :) I'm trying to get a hang on a few concepts regarding floating point, SIMD/math intrinsics and the fast-math flag for gcc. More specifically, I'm using MinGW with gcc v4.5.0 on a x86 cpu. I've searched around for a while now, and that's what I (think I) understand at the moment: When I compile with no flags, any fp code will be standard x87, no simd intrinsics, and the math.h functions will be linked from msvcrt.dll. When I use mfpmath , mssen and/or march so that mmx/sse/avx code