intrinsics

Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?

对着背影说爱祢 提交于 2019-12-12 18:08:57
问题 I'm using the AVX2 x86 256-bit SIMD extensions. I want to do a 32-bit integer component wise if-then-else instruction. In the Intel documentations such an instruction is called vblend. The Intel intrinsic guide contains the function _mm256_blendv_epi8. This function does nearly what I need. The only problem is that it works with 8-bit integers. Unfortunately there is no _mm256_blendv_epi32 in docs. My first question is: Why does this function not exist? My second question is: How to emulate

Issues of compiler generated assembly for intrinsics

限于喜欢 提交于 2019-12-12 12:22:22
问题 I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions. Given the following code #include <cmath> #include <immintrin.h> auto std_fma(float x, float y, float z) { return std::fma(x, y, z); } float _fma(float x, float y, float z) { _mm_store_ss(&x, _mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z)) ); return x; } float _sqrt(float x) { _mm_store_ss(&x, _mm_sqrt_ss(_mm_load_ss(&x)) ); return x; } the clang 3.9 generated

Can a C++ Compiler Eliminate a Volatile Local Var that is not Read

核能气质少年 提交于 2019-12-12 11:30:32
问题 Say, I have this code: int f() { volatile int c; c=34; return abc(); } The volatile int c is never read. But it is marked as volatile , can the compiler eliminate it altogether? My testing in Visual C++ 2010 shows contradictory results. In VC++, if I enable optimization (maximizing speed) the above function contains a local variable called c (by looking at the generated assembly listing). But, instead of using assignment operator, I also tried to initialize the variable by a compiler

Visual Studio 2017: _mm_load_ps often compiled to movups

余生颓废 提交于 2019-12-12 11:14:35
问题 I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups. The data I'm using _mm_load_ps on is defined like this: struct alignas(16) Vector { float v[4]; } // often embedded in other structs like this struct AABB { Vector min; Vector max; bool intersection(/* parameters */) const; } Now when I'm using this construct, the following will happen: // this code __mm128 bb_min = _mm_load_ps(min.v); // generates

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

牧云@^-^@ 提交于 2019-12-12 09:53:19
问题 I have two unsigned vectors, both with size 4 vector<unsigned> v1 = {2, 4, 6, 8} vector<unsigned> v2 = {1, 10, 11, 13} Now I want to multiply these two vectors and get a new one vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13} What is the SSE operation to use? Is it cross platform or only in some specified platforms? Adding: If my goal is adding not multiplication, I can do this super fast: __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); __m128i c; c = _mm_add_epi32

How can I get an intrinsic for the exp() function in x64 code?

纵然是瞬间 提交于 2019-12-12 08:29:44
问题 I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build): #include "stdafx.h" #include <cmath> #include <intrin.h> #include <iostream> int main() { const int NUM_ITERATIONS=10000000; double expNum=0.00001; double result=0.0; for (double i=0;i<NUM_ITERATIONS;++i) { result+=exp(expNum); // <-- The code of interest is here expNum+=0.00001; } // To prevent

unexpected _mm256_shuffle_epi with __256i vectors

走远了吗. 提交于 2019-12-12 04:42:46
问题 I had seen this great answer on image conversions using __m128i , and thought I'd try and use AVX2 to see if I could get it any faster. The task is taking an input RGB image and converting it to RGBA (note the other question is BGRA, but that's not really a big difference...). I can include more code if desired, but this stuff gets quite verbose and I'm stuck on something seemingly very simple. Suppose for this code that everything is 32-byte aligned, compiled with -mavx2 , etc. Given an

SSE byte and half word swapping

拥有回忆 提交于 2019-12-12 01:22:50
问题 I would like to translate this code using SSE intrinsics. for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4) { uint32_t value = *(uint32_t*)src; *(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16); } Is anyone aware of an intrinsic to perform the 16-bit word swapping? 回答1: pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap. stealing Paul R's function

BitScanForward64 issue in Visual Studio 11 Developer Preview

拟墨画扇 提交于 2019-12-12 01:13:52
问题 I am totally new to writing anything in C. I am writing a helper DLL (to be called from C#) that performs binary manipulation. I get an 'identifier "BitScanForward64" is undefined' error. The 32-bit version is available. I figure this is because I created a Win32 DLL. It then dawned on me that the 64 bit version may only be available to a specific 64 bit DLL (I assume "General" on the new project wizard), and that I may need a separate 32 bit and 64 bit dll. Is this the case, or can I have a

VC2012 where to find _mm256_pow_pd?

邮差的信 提交于 2019-12-12 00:57:52
问题 Using Visual Studio 2012/C++: I need to apply gamma correction to my resampler code. From Intels Docs I learned that there should be the intrinsic _mm256_pow_pd() , but I can't find it. Planned use: _mm256_storeu_pd(&destinationData[y*dst4+x], _mm256_pow_pd(akku, _mm256_broadcast_sd(&gamma))); Any ideas where Microsoft has hidden this intrinsic? 回答1: Obviously I didn't google enough. This SO answer implicitly answers my question: https://stackoverflow.com/a/31515534/2896592 Content: _mm256