sse | 易学教程

Compact a hex number

阅读更多关于 Compact a hex number

Is there a clever (ie: branchless) way to "compact" a hex number. Basically move all the 0s all to one side? eg: 0x10302040 -> 0x13240000 or 0x10302040 -> 0x00001324 I looked on Bit Twiddling Hacks but didn't see anything. It's for a SSE numerical pivoting algorithm. I need to remove any pivots that become 0. I can use _mm_cmpgt_ps to find good pivots, _mm_movemask_ps to convert that in to a mask, and then bit hacks to get something like the above. The hex value gets munged in to a mask for a _mm_shuffle_ps instruction to perform a permutation on the SSE 128 bit register. To compute mask for

The correct way to sum two arrays with SSE2 SIMD in C++

阅读更多关于 The correct way to sum two arrays with SSE2 SIMD in C++

Let's start by including the following: #include <vector> #include <random> using namespace std; Now, suppose that one has the following three std:vector<float> : N = 1048576; vector<float> a(N); vector<float> b(N); vector<float> c(N); default_random_engine randomGenerator(time(0)); uniform_real_distribution<float> diceroll(0.0f, 1.0f); for(int i-0; i<N; i++) { a[i] = diceroll(randomGenerator); b[i] = diceroll(randomGenerator); } Now, assume that one needs to sum a and b element-wise and store the result in c , which in scalar form looks like the following: for(int i=0; i<N; i++) { c[i] = a[i]

Is accessing bytes of a __m128 variable via union legal?

阅读更多关于 Is accessing bytes of a __m128 variable via union legal?

问题 Consider this variable declaration: union { struct { float x, y, z, padding; } components; __m128 sse; } _data; My idea is to assign the value through x , y , z fields, perform SSE2 computations and read the result through x , y , z . I have slight doubts as to whether it is legal, though. My concern is alignment: MSDN says that __m128 variables are automatically aligned to 16 byte boundary, and I wonder if my union can break this behavior. Are there any other pitfalls to consider here? 回答1:

pthreads v. SSE weak memory ordering

阅读更多关于 pthreads v. SSE weak memory ordering

问题 Do the Linux glibc pthread functions on x86_64 act as fences for weakly-ordered memory accesses? (pthread_mutex_lock/unlock are the exact functions I'm interested in). SSE2 provides some instructions with weak memory ordering (non-temporal stores such as movntps in particular). If you are using these instructions and want to guarantee that another thread/core sees an ordering, then I understand you need an explicit fence for this, e.g., a sfence instruction. Normally you do expect the pthread

SSE2: Double precision log function

阅读更多关于 SSE2: Double precision log function

问题 I need open source (no restriction on license) implementation of log function, something with signature __m128d _mm_log_pd(__m128d); It is available in Intel Short Vector Math Library (part of ICC), but ICC is neither free nor open source. I am looking for implementation using intrinsics only. It should use special rational function approximations. I need something almost as accurate as cmath log, say 9-10 decimal digits, but faster. 回答1: Take a look at AMD LibM. It isn't open source, but

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

阅读更多关于 Does gcc use Intel's SSE 4.2 instructions for text processing if available?

问题 I read here that Intel introduced SSE 4.2 instructions for accelerating string processing. Quote from the article: The SSE 4.2 instruction set, first implemented in Intel's Core i7, provides string and text processing instructions (STTNI) that utilize SIMD operations for processing character data. Though originally conceived for accelerating string, text, and XML processing, the powerful new capabilities of these instructions are useful outside of these domains, and it is worth revisiting the

Atomic operators, SSE/AVX, and OpenMP

阅读更多关于 Atomic operators, SSE/AVX, and OpenMP

I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions. Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code: float4 sum4 = 0.0f; //sets all four values to zero #pragma omp parallel { float4 sum_private = 0.0f; #pragma omp for nowait

Whats a good place to start learning assembly?

阅读更多关于 Whats a good place to start learning assembly?

I need to learn assembly using SSE instructions and need gcc to link the ASM code with c code. I have no idea where to start and google hasn't helped. You might want to start looking through the chip documentation from intel Intel Processor Software Developer Manuals . Assembly language coding isn't a whole lot of fun, and it's usually unneccessary except in few cases where code is performance critical. Given you are looking at SSE, I would hazard that your effort may be better spent looking into CUDA, using your graphics card to perform vector computations via custom shaders. That way you don

SSE: Difference between _mm_load/store vs. using direct pointer access

阅读更多关于 SSE: Difference between _mm_load/store vs. using direct pointer access

问题 Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that. void _add( uint16_t * dst, uint16_t const * src, size_t n ) { for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 ) { __m128i _s = _mm_load_si128( (

SIMD and difference between packed and scalar double precision

阅读更多关于 SIMD and difference between packed and scalar double precision

问题 I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points. What does "packed" mean? Do I need to pack my float values somehow before I can use them? For double precision there are intrinsics like _mm_cmpeq_sd which means compare the "lower" double precision floating point elements. What does lower and