avx

Intel AVX intrinsics: any compatibility library out?

大憨熊 提交于 2019-12-01 04:11:12
Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX extension isn't available. Googling didn't help much so far :( Intel provides a AVX emulation header . I haven't tried it, but quoting the linked article "The AVX emulation header file uses intrinsics for the prior Intel instruction set extensions up to Intel SSE4

Issue with __m256 type of intel intrinsics

独自空忆成欢 提交于 2019-12-01 04:06:42
问题 I'm trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code: void test_intel_256() { __m256 res,vec1,vec2; __M256_MM_SET_PS(vec1, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0); __M256_MM_SET_PS(vec1, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0); __M256_MM_ADD_PS(res,vec1,vec2); if (res[0] ==9 && res[1] ==9 && res[2] ==9 && res[3] ==9 && res[4] ==9 && res[5] ==9 && res[6] ==9 && res[7] ==9 ) printf("Addition : OK!\n"); else printf(

SIMD minmag and maxmag

旧城冷巷雨未停 提交于 2019-12-01 03:59:51
I want to implement SIMD minmag and maxmag functions . As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical) static inline void maxminmag(__m128d & a, __m128d & b) { __m128d mask = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF)); __m128d aa = _mm_and_pd(a,mask); __m128d ab = _mm_and_pd(b,mask); __m128d cmp = _mm_cmple_pd(ab,aa); __m128d

Intel AVX intrinsics: any compatibility library out?

对着背影说爱祢 提交于 2019-12-01 02:22:01
问题 Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX extension isn't available. Googling didn't help much so far :( 回答1: Intel provides a AVX emulation header. I haven't tried it, but quoting the linked article "The

Does Xcode 4 have support for AVX?

爷,独闯天下 提交于 2019-12-01 02:01:09
Before I spend time and money downloading Xcode 4, can anyone tell me whether it comes with a version of gcc (or any other compiler, e.g. LLVM) which supports the AVX instruction set on Sandy Bridge CPUs (i.e. gcc -mavx on mainstream gcc builds) ? I don't seen any public release notes anywhere so it's not easy to check, and I don't really need Xcode 4 yet unless it has AVX support. I eventually cracked and downloaded Xcode 4 - it looks like clang is the only compiler that may support AVX currently, although I haven't tested it properly: $ clang -dM -E -mavx - < /dev/null | grep -i avx #define

SIMD minmag and maxmag

徘徊边缘 提交于 2019-12-01 01:49:07
问题 I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical) static inline void maxminmag(__m128d & a, __m128d & b) { __m128d mask = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF)); __m128d

Finding lists of prime numbers with SIMD - SSE/AVX

自作多情 提交于 2019-11-30 23:19:46
I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX. The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I find a prime I form the results to do a fast division and save them. Then the next time I do the

How to align stack at 32 byte boundary in GCC?

不打扰是莪最后的温柔 提交于 2019-11-30 19:37:22
I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment. How can I change the GCC's stack alignment to 32 bytes? I have tried using -mstackrealign but to no

Integer dot product using SSE/AVX?

我只是一个虾纸丫 提交于 2019-11-30 18:44:30
问题 I am looking at the intel intrinsic guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and whilst they have _mm_dp_ps and _mm_dp_pd for calculating the dot product for floats and doubles I cannot see anything for calculating the integer dot product. I have two unsigned int[8] arrays and I would like to: (a[0] x b[0]) + (a[1] * b[1])....... + (a[num_elements_in_array-1] * b[num_elements_in_array-1]) (in batches of four) and sum the products? 回答1: Every time someone does this:

8 bit shift operation in AVX2 with shifting in zeros

爱⌒轻易说出口 提交于 2019-11-30 18:20:59
Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has anybody already a solution for this problem? okay I implemented a function that can shift left up to 16 byte.