avx | 易学教程

Intel AVX intrinsics: any compatibility library out?

阅读更多关于 Intel AVX intrinsics: any compatibility library out?

Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX extension isn't available. Googling didn't help much so far :( Intel provides a AVX emulation header . I haven't tried it, but quoting the linked article "The AVX emulation header file uses intrinsics for the prior Intel instruction set extensions up to Intel SSE4

Issue with __m256 type of intel intrinsics

阅读更多关于 Issue with __m256 type of intel intrinsics

问题 I'm trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code: void test_intel_256() { __m256 res,vec1,vec2; __M256_MM_SET_PS(vec1, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0); __M256_MM_SET_PS(vec1, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0); __M256_MM_ADD_PS(res,vec1,vec2); if (res[0] ==9 && res[1] ==9 && res[2] ==9 && res[3] ==9 && res[4] ==9 && res[5] ==9 && res[6] ==9 && res[7] ==9 ) printf("Addition : OK!\n"); else printf(

SIMD minmag and maxmag

阅读更多关于 SIMD minmag and maxmag

I want to implement SIMD minmag and maxmag functions . As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical) static inline void maxminmag(__m128d & a, __m128d & b) { __m128d mask = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF)); __m128d aa = _mm_and_pd(a,mask); __m128d ab = _mm_and_pd(b,mask); __m128d cmp = _mm_cmple_pd(ab,aa); __m128d

Intel AVX intrinsics: any compatibility library out?

阅读更多关于 Intel AVX intrinsics: any compatibility library out?

问题 Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX extension isn't available. Googling didn't help much so far :( 回答1: Intel provides a AVX emulation header. I haven't tried it, but quoting the linked article "The

Does Xcode 4 have support for AVX?

阅读更多关于 Does Xcode 4 have support for AVX?

Before I spend time and money downloading Xcode 4, can anyone tell me whether it comes with a version of gcc (or any other compiler, e.g. LLVM) which supports the AVX instruction set on Sandy Bridge CPUs (i.e. gcc -mavx on mainstream gcc builds) ? I don't seen any public release notes anywhere so it's not easy to check, and I don't really need Xcode 4 yet unless it has AVX support. I eventually cracked and downloaded Xcode 4 - it looks like clang is the only compiler that may support AVX currently, although I haven't tested it properly: $ clang -dM -E -mavx - < /dev/null | grep -i avx #define

SIMD minmag and maxmag

阅读更多关于 SIMD minmag and maxmag

问题 I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical) static inline void maxminmag(__m128d & a, __m128d & b) { __m128d mask = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF)); __m128d

Finding lists of prime numbers with SIMD - SSE/AVX

阅读更多关于 Finding lists of prime numbers with SIMD - SSE/AVX

I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX. The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I find a prime I form the results to do a fast division and save them. Then the next time I do the

How to align stack at 32 byte boundary in GCC?

阅读更多关于 How to align stack at 32 byte boundary in GCC?

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment. How can I change the GCC's stack alignment to 32 bytes? I have tried using -mstackrealign but to no

Integer dot product using SSE/AVX?

阅读更多关于 Integer dot product using SSE/AVX?

问题 I am looking at the intel intrinsic guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and whilst they have _mm_dp_ps and _mm_dp_pd for calculating the dot product for floats and doubles I cannot see anything for calculating the integer dot product. I have two unsigned int[8] arrays and I would like to: (a[0] x b[0]) + (a[1] * b[1])....... + (a[num_elements_in_array-1] * b[num_elements_in_array-1]) (in batches of four) and sum the products? 回答1: Every time someone does this:

8 bit shift operation in AVX2 with shifting in zeros

阅读更多关于 8 bit shift operation in AVX2 with shifting in zeros

Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has anybody already a solution for this problem? okay I implemented a function that can shift left up to 16 byte.