avx

When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

大兔子大兔子 提交于 2019-12-01 15:53:11
问题 When I do a writemasked AVX-512 store, like so: vmovdqu8 [rsi] {k1}, zmm0 Will the instruction fault if some portion of the memory accessed at [rsi, rsi + 63] is not mapped but the writemask is zero for all those locations (i.e., the data is not actually modified due to the mask). Another way of asking it is if these AVX-512 masked stores have a similar fault suppression ability to vmaskmov introduced in AVX. 回答1: No fault is raised if masked out elements touch invalid memory. Here's some

How to get data out of AVX registers?

核能气质少年 提交于 2019-12-01 15:36:10
Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated: print(_castu32_f32(_mm256_extract_epi32(foo, 0))); print(_castu32_f32(_mm256_extract_epi32(foo, 1))); print(_castu32_f32(_mm256_extract_epi32(foo, 2))); // ... but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

Automatically generate FMA instructions in MSVC

我们两清 提交于 2019-12-01 15:34:33
问题 MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions. Yet neither of the following functions compile to FMA instruction: float func1(float x, float y, float z) { return x * y + z; } float func2(float x, float y, float z) { return std::fma(x,y,z); } Even worse, std::fma is not implemented as a single FMA instruction, it performs terribly, much slower than a plain x * y + z (the poor

Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

≡放荡痞女 提交于 2019-12-01 15:21:24
I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements go both ways. I'm just wondering if there are some good ways to think about which one would be better

Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

…衆ロ難τιáo~ 提交于 2019-12-01 15:02:25
问题 I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements

What is the minimum version of OS X for use with AVX/AVX2?

坚强是说给别人听的谎言 提交于 2019-12-01 11:26:16
I have an image drawing routine which is compiled multiple times for SSE, SSE2, SSE3, SSE4.1, SSE4.2, AVX and AVX2. My program dynamically dispatches one of these binary variations by checking CPUID flags. On Windows, I check the version of Windows and disable AVX/AVX2 dispatch if the OS doesn't support them. (For example, only Windows 7 SP1 or later supports AVX/AVX2.) I want to do the same thing on Mac OS X, but I'm not sure what version of OS X supports AVX/AVX2. Note that what I want to know is the minimum version of OS X for use with AVX/AVX2. Not machine models which are capable of AVX

SSE loading ints into __m128

不羁的心 提交于 2019-12-01 09:24:53
What are the gcc's intrinsic for loading 4 ints into __m128 and 8 ints into __m256 (aligned/unaligned)? What about unsigned ints ? Mysticial Using Intel's SSE intrnisics, the ones you're looking for are: _mm_load_si128() _mm_loadu_si128() _mm256_load_si256() _mm256_loadu_si256() Documentation: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_load_si128 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_load_si256 There's no distinction between signed or unsigned. You'll need to cast the pointer to __m128i* or __m256i* . Note that these are Intel's

set individual bit in AVX register (__m256i), need “random access” operator

亡梦爱人 提交于 2019-12-01 06:21:57
问题 So, I want to set an individual bit of a __m256i register. Say, my __m256i contains: [ 1 0 1 0 | 1 0 1 0 | ... | 1 0 1 0 ] , how do I set and unset the n-th bit? 回答1: This is an implementation of function which can set individual bit inside a vector: #include <immintrin.h> #include <assert.h> void SetBit(__m256i & vector, size_t position, bool value) { assert(position <= 255); uint8_t lut[32] = { 0 }; lut[position >> 3] = 1 << (position & 7); __m256i mask = _mm256_loadu_si256((__m256i*)lut);

Issue with __m256 type of intel intrinsics

▼魔方 西西 提交于 2019-12-01 05:43:24
I'm trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code: void test_intel_256() { __m256 res,vec1,vec2; __M256_MM_SET_PS(vec1, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0); __M256_MM_SET_PS(vec1, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0); __M256_MM_ADD_PS(res,vec1,vec2); if (res[0] ==9 && res[1] ==9 && res[2] ==9 && res[3] ==9 && res[4] ==9 && res[5] ==9 && res[6] ==9 && res[7] ==9 ) printf("Addition : OK!\n"); else printf("Addition : FAILED!\n"); } But then i'm getting these errors: error: unknown type name ‘__m256’ error:

implict SIMD (SSE/AVX) broadcasts with GCC

江枫思渺然 提交于 2019-12-01 05:12:21
I have manged to convert most of my SIMD code to us the vector extensions of GCC. However, I have not found a good solution for doing a broadcast as follows __m256 areg0 = _mm256_broadcast_ss(&a[i]); I want to do __m256 argeg0 = a[i]; If you see my answer at Mutiplying vector by constant using SSE I managed to get broadcasts working with another SIMD register. The following works: __m256 x,y; y = x + 3.14159f; // broadcast x + 3.14159 y = 3.14159f*x; // broadcast 3.14159*x but this won't work: __m256 x; x = 3.14159f; //should broadcast 3.14159 but does not work How can I do this with GCC? I