avx | 易学教程

L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes

阅读更多关于 L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes

I want to achieve the maximum bandwidth of the following operations with Intel processors. for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048 where x, y, and z are float arrays. I am doing this on Haswell, Ivy Bridge , and Westmere systems. I originally allocated the memory like this char *a = (char*)_mm_malloc(sizeof(float)*n, 64); char *b = (char*)_mm_malloc(sizeof(float)*n, 64); char *c = (char*)_mm_malloc(sizeof(float)*n, 64); float *x = (float*)a; float *y = (float*)b; float *z = (float*)c; When I did this I got about 50% of the peak bandwidth I expected for each system. The peak values

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

阅读更多关于 Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

问题 I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3] shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2] For AVX I want to shift do the following shifts: shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7] shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6] shift3_AVX: [1, 2, 3,

practical BigNum AVX/SSE possible?

阅读更多关于 practical BigNum AVX/SSE possible?

SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or combined? I ask because from what little I've seen of BigNum libraries, they almost universally store and do arithmetic on arrays, not on SSE/AVX registers. Portability? Example: Say you store the contents of a SSE register as a key in a std::set , you could compare these contents as a BigNum. I think it may be possible to implement BigNum with SIMD

SIMD math libraries for SSE and AVX

阅读更多关于 SIMD math libraries for SSE and AVX

问题 I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once. AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It

Loading 8 chars from memory into an __m256 variable as packed single precision floats

阅读更多关于 Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task? // unsigned char *new_image is loaded with data ... float buffer[8]; buffer[x ] = new_image[x]; buffer[x + 1] = new_image[x + 1]; buffer[x + 2] = new_image[x + 2]; buffer[x + 3] = new_image[x + 3]; buffer[x + 4] = new_image[x + 4]; buffer[x + 5] = new_image[x + 5]; buffer[x + 6] = new_image[x + 6]; buffer[x + 7] = new_image[x + 7]; // buffer is then used for further

How to efficiently perform double/int64 conversions with SSE/AVX?

阅读更多关于 How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing: _mm_cvtpd_epi64() _mm_cvtepi64_pd() It seems that AVX doesn't have them either. What is the most efficient way to simulate these intrinsics? If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions: If you don't care about infinity or NaN . For double <-> int64_t , you only care about values in the range [-2^51, 2^51

Code alignment in one object file is affecting the performance of a function in another object file

阅读更多关于 Code alignment in one object file is affecting the performance of a function in another object file

问题 I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment. Here is a function I have been trying this on a Ivy Bridge system void triad(float *x, float *y, float *z, int n, int repeat) { float k = 3.14159f; int(int r=0; r<repeat; r++) { for(int i=0; i<n; i++) { z[i] = x[i]

How to sum __m256 horizontally?

阅读更多关于 How to sum __m256 horizontally?

问题 I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256 bit version of the function ( _mm256_hadd_ps ). What is the best way to compute the horizontal sum of a __m256 vector? 回答1: This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer: // x = ( x7, x6, x5, x4, x3, x2, x1, x0 )

How to use AVX/pclmulqdq on Mac OS X

阅读更多关于 How to use AVX/pclmulqdq on Mac OS X

问题 I am trying to compile a program that uses the pclmulqdq instruction present in new Intel processors. I've installed GCC 4.6 using macports but when I compile my program (which uses the intrinsic _mm_clmulepi64_si128), I get /var/folders/ps/sfjmtgx5771_qbqnh4c9xclr0000gn/T//ccEAWWhd.s:16:no such instruction: `pclmulqdq $0, %xmm0,%xmm1' It seems that GCC is able to generate the correct assembly code from the instrinsic, but the assembler does not recognize the instruction. I've installed

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

阅读更多关于 How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE: //sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication) sum = _mm_set1_ps(0.0f); a1 = _mm_set1_ps(a[0]); b1 = _mm_load_ps(&b[0]); sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1)); a2 = _mm