avx

L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes

↘锁芯ラ 提交于 2019-11-27 04:26:33
I want to achieve the maximum bandwidth of the following operations with Intel processors. for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048 where x, y, and z are float arrays. I am doing this on Haswell, Ivy Bridge , and Westmere systems. I originally allocated the memory like this char *a = (char*)_mm_malloc(sizeof(float)*n, 64); char *b = (char*)_mm_malloc(sizeof(float)*n, 64); char *c = (char*)_mm_malloc(sizeof(float)*n, 64); float *x = (float*)a; float *y = (float*)b; float *z = (float*)c; When I did this I got about 50% of the peak bandwidth I expected for each system. The peak values

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

♀尐吖头ヾ 提交于 2019-11-27 03:26:56
问题 I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3] shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2] For AVX I want to shift do the following shifts: shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7] shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6] shift3_AVX: [1, 2, 3,

practical BigNum AVX/SSE possible?

房东的猫 提交于 2019-11-27 02:14:41
SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or combined? I ask because from what little I've seen of BigNum libraries, they almost universally store and do arithmetic on arrays, not on SSE/AVX registers. Portability? Example: Say you store the contents of a SSE register as a key in a std::set , you could compare these contents as a BigNum. I think it may be possible to implement BigNum with SIMD

SIMD math libraries for SSE and AVX

邮差的信 提交于 2019-11-27 02:13:07
问题 I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once. AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It

Loading 8 chars from memory into an __m256 variable as packed single precision floats

扶醉桌前 提交于 2019-11-27 02:02:23
I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task? // unsigned char *new_image is loaded with data ... float buffer[8]; buffer[x ] = new_image[x]; buffer[x + 1] = new_image[x + 1]; buffer[x + 2] = new_image[x + 2]; buffer[x + 3] = new_image[x + 3]; buffer[x + 4] = new_image[x + 4]; buffer[x + 5] = new_image[x + 5]; buffer[x + 6] = new_image[x + 6]; buffer[x + 7] = new_image[x + 7]; // buffer is then used for further

How to efficiently perform double/int64 conversions with SSE/AVX?

橙三吉。 提交于 2019-11-27 01:36:25
SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing: _mm_cvtpd_epi64() _mm_cvtepi64_pd() It seems that AVX doesn't have them either. What is the most efficient way to simulate these intrinsics? If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions: If you don't care about infinity or NaN . For double <-> int64_t , you only care about values in the range [-2^51, 2^51

Code alignment in one object file is affecting the performance of a function in another object file

青春壹個敷衍的年華 提交于 2019-11-26 23:37:57
问题 I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment. Here is a function I have been trying this on a Ivy Bridge system void triad(float *x, float *y, float *z, int n, int repeat) { float k = 3.14159f; int(int r=0; r<repeat; r++) { for(int i=0; i<n; i++) { z[i] = x[i]

How to sum __m256 horizontally?

…衆ロ難τιáo~ 提交于 2019-11-26 23:09:06
问题 I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256 bit version of the function ( _mm256_hadd_ps ). What is the best way to compute the horizontal sum of a __m256 vector? 回答1: This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer: // x = ( x7, x6, x5, x4, x3, x2, x1, x0 )

How to use AVX/pclmulqdq on Mac OS X

心已入冬 提交于 2019-11-26 22:23:08
问题 I am trying to compile a program that uses the pclmulqdq instruction present in new Intel processors. I've installed GCC 4.6 using macports but when I compile my program (which uses the intrinsic _mm_clmulepi64_si128), I get /var/folders/ps/sfjmtgx5771_qbqnh4c9xclr0000gn/T//ccEAWWhd.s:16:no such instruction: `pclmulqdq $0, %xmm0,%xmm1' It seems that GCC is able to generate the correct assembly code from the instrinsic, but the assembler does not recognize the instruction. I've installed

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

扶醉桌前 提交于 2019-11-26 21:56:21
I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE: //sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication) sum = _mm_set1_ps(0.0f); a1 = _mm_set1_ps(a[0]); b1 = _mm_load_ps(&b[0]); sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1)); a2 = _mm