sse | 易学教程

SSE2: Double precision log function

阅读更多关于 SSE2: Double precision log function

I need open source (no restriction on license) implementation of log function, something with signature __m128d _mm_log_pd(__m128d); It is available in Intel Short Vector Math Library (part of ICC), but ICC is neither free nor open source. I am looking for implementation using intrinsics only. It should use special rational function approximations. I need something almost as accurate as cmath log, say 9-10 decimal digits, but faster. Take a look at AMD LibM . It isn't open source, but free. AFAIK, it works on Intel CPUs. On the same web page you find a link to ACML, another free math lib from

Are older SIMD-versions available when using newer ones?

阅读更多关于 Are older SIMD-versions available when using newer ones?

When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports saving the AVX registers as well. You should still explicitly check for all the CPUID support you use in your

Is accessing bytes of a __m128 variable via union legal?

阅读更多关于 Is accessing bytes of a __m128 variable via union legal?

Consider this variable declaration: union { struct { float x, y, z, padding; } components; __m128 sse; } _data; My idea is to assign the value through x , y , z fields, perform SSE2 computations and read the result through x , y , z . I have slight doubts as to whether it is legal, though. My concern is alignment: MSDN says that __m128 variables are automatically aligned to 16 byte boundary, and I wonder if my union can break this behavior. Are there any other pitfalls to consider here? The union's alignment should be fine, but in the case of Windows you may be able to access the 32 bit

SIMD the following code

阅读更多关于 SIMD the following code

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot: int sum_naive( int n, int *a ) { int sum = 0; for( int i = 0; i < n; i++ ) sum += a[i]; return sum; } Here's a fairly straightforward implementation (warning: untested code): int32_t sum_array(const int32_t a[], const int n) { __m128i vsum = _mm_set1_epi32(0); // initialise vector of four partial 32 bit sums int32_t sum; int i; for (i = 0; i < n; i += 4) { __m128i v = _mm_load_si128(&a[i]); // load vector of 4 x 32 bit values vsum = _mm_add

SSE: Difference between _mm_load/store vs. using direct pointer access

阅读更多关于 SSE: Difference between _mm_load/store vs. using direct pointer access

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that. void _add( uint16_t * dst, uint16_t const * src, size_t n ) { for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 ) { __m128i _s = _mm_load_si128( (__m128i*) src ); __m128i _d = _mm_load_si128( (__m128i*) dst ); _d = _mm_add_epi16( _d, _s ); _mm_store

How to compare two vectors using SIMD and get a single boolean result?

阅读更多关于 How to compare two vectors using SIMD and get a single boolean result?

问题 I have two vectors of 4 integers each and I'd like to use a SIMD command to compare them (say generate a result vector where each entry is 0 or 1 according to the result of the comparison). Then, I'd like to compare the result vector to a vector of 4 zeros and only if they're equal do something. Do you know what SIMD commands I can use to do it? 回答1: To compare two SIMD vectors: #include <stdint.h> #include <xmmintrin.h> int32_t __attribute__ ((aligned(16))) vector1[4] = { 1, 2, 3, 4 }; int32

How do I convert _m128i to an unsigned int with SSE?

阅读更多关于 How do I convert _m128i to an unsigned int with SSE?

I have made a function for posterizing images. // =( #define ARGB_COLOR(a, r, g, b) (((a) << 24) | ((r) << 16) | ((g) << 8) | (b)) inline UINT PosterizeColor(const UINT &color, const float &nColors) { __m128 clr = _mm_cvtepi32_ps( _mm_cvtepu8_epi32((__m128i&)color) ); clr = _mm_mul_ps(clr, _mm_set_ps1(nColors / 255.0f) ); clr = _mm_round_ps(clr, _MM_FROUND_TO_NEAREST_INT); clr = _mm_mul_ps(clr, _mm_set_ps1(255.0f / nColors) ); __m128i iClr = _mm_cvttps_epi32(clr); return ARGB_COLOR(iClr.m128i_u8[12], iClr.m128i_u8[8], iClr.m128i_u8[4], iClr.m128i_u8[0]); } in the first line, I unpack the color

Disable AVX2 functions on non-Haswell processors

阅读更多关于 Disable AVX2 functions on non-Haswell processors

I have written some AVX2 code to run on a Haswell i7 processor. The same codebase is also used on non-Haswell processors, where the same code should be replaced with their SSE equivalents. I was wondering is there a way for the compiler to ignore AVX2 instructions on non-Haswell processors. I need something like: public void useSSEorAVX(...){ IF (compiler directive detected AVX2) AVX2 code (this part is ready) ELSE SSE code (this part is also ready) } } Right now I am commenting out related code before compiling but there must be some more efficient way to do this. I am using Ubuntu and gcc.

SSE, row major vs column major performance issue

阅读更多关于 SSE, row major vs column major performance issue

For personnal and fun matter, I'm coding a geom lib using SSE(4.1). I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix. I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders. But, I made some profiling, and I get faster result with colomun major. To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P. So in Column major it

are static / static local SSE / AVX variables blocking a xmm / ymm register?

阅读更多关于 are static / static local SSE / AVX variables blocking a xmm / ymm register?

When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in static inline __m128i negate(__m128i a) { static __m128i zero = __mm_setzero_si128(); return _mm_sub_epi16(zero, a); } It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a true function instead of the _mm_setzero_si128() intrinsic. It only seems to be possible in C++, not in