sse | 易学教程

C style cast versus intrinsic cast

阅读更多关于 C style cast versus intrinsic cast

问题 Let's assume i have defined __m256d x and that I want to extract the lower 128-bits. I would do: __m128d xlow = _mm256_castpd256_pd128(x); However, I recently saw someone do: __m128d xlow = (__m128d) x Is there a prefered method to use for the cast? Why use the first method? 来源： https://stackoverflow.com/questions/20401413/c-style-cast-versus-intrinsic-cast

SSE alpha blending for pre-multiplied ARGB

阅读更多关于 SSE alpha blending for pre-multiplied ARGB

问题 I'm trying to write an SSE-enabled alpha compositor, this is what I've come up with. First, the code to blend two vectors of 4-pixels each: // alpha blend two 128-bit (16 byte) SSE vectors containing 4 pre-multiplied ARGB values each // __attribute__((always_inline)) static inline __m128i blend4(__m128i under, __m128i over) { // shuffle masks for alpha and 255 vector for 255-alpha // // NOTE: storing static __m128i here with _mm_set_si128 was _very_ slow, compiler doesn't seem // to know it

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

阅读更多关于 C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

问题 I'm trying to apply the log2 onto a __m128 variable. Like this: #include <immintrin.h> int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this returns this error: error: initializing '__m128' with an expression of incompatible type 'int' __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) ^ ~~~~~~~~~~~~~~~~~~ How can I fix it? 回答1: The immintrin.h you look into and immintrin.h used for compilation are

Compact a hex number

阅读更多关于 Compact a hex number

问题 Is there a clever (ie: branchless) way to "compact" a hex number. Basically move all the 0s all to one side? eg: 0x10302040 -> 0x13240000 or 0x10302040 -> 0x00001324 I looked on Bit Twiddling Hacks but didn't see anything. It's for a SSE numerical pivoting algorithm. I need to remove any pivots that become 0. I can use _mm_cmpgt_ps to find good pivots, _mm_movemask_ps to convert that in to a mask, and then bit hacks to get something like the above. The hex value gets munged in to a mask for a

Fast byte-wise replace if

阅读更多关于 Fast byte-wise replace if

问题 I have a function that copies binary data from one area to another, but only if the bytes are different from a specific value. Here is a code sample: void copy_if(char* src, char* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { if (src[i] != ignore) dest[i] = src[i]; } } The problem is that this is too slow for my current need. Is there a way to obtain the same result in a faster way? Update: Based on answers I tried two new implementations: void copy_if_vectorized(const

Dereference pointers in XMM register (gather)

阅读更多关于 Dereference pointers in XMM register (gather)

问题 If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory? Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with

SSE4.1 automatically put in string comparison on newer gcc

阅读更多关于 SSE4.1 automatically put in string comparison on newer gcc

问题 I searched the gcc 4.8.1 documents but couldn't find an answer to this: I have some SSE4.1 code and fallback code, at runtime I detect whether the system supports SSE4.1 and in case it doesn't, I use the fallback code. So far so good, but with latest gcc versions this is what happens: - my application crashes because SSE4.1 instructions are being spread throughout the code every time a string comparison is performed Since I'm compiling all my files with -msse41 this sounds reasonable but

Shift right every DW in a __m128i by a different amount

阅读更多关于 Shift right every DW in a __m128i by a different amount

问题 I want to shift right every element of a __m128i register by a different amount.I know this is possible by multiplication if we want to shift left like below: __m128i mul_constant = _mm_set_epi32(8, 4, 2, 1); __m128i left_vshift = _mm_mullo_epi32(R, mul_constant); But, what is the solution if we want to shift it right? 回答1: I finally did it like below: Shifting every byte by a different amount to left and then a 32-bit right shift by 3 gave me what I wanted. R = _mm_mullo_epi32(R, _mm_set

Is possible to address the output SIMD register by using an input register

阅读更多关于 Is possible to address the output SIMD register by using an input register

问题 Is it possible to use the scalar values of an input vector to index the output vector? I try to implement the following function in SIMD but I can not find any solution. void shuffle(unsigned char * a, // input a unsigned char * r){ // output r for (i=0; i < 16; i++) r[i] = 0; for (i=0; i < 16; i++) r[a[i] % 16] = 1; } An example input / output vector would look like this unsigned char * a = {0, 0, 0, 10, 0, 0, 0, 2, 0, 0, 0, 0, 3, 1, 0, 0 }; ... do SIMD magic // 0 1 2 3 4 5 6 7 8 9 10 11 12

Saving the XMM register before function call

阅读更多关于 Saving the XMM register before function call

问题 Is it required to save/push the any XMM registers to the stack before the assembly function call? Because am observing the crash issue in my code with release mode for 64-bit development(Using AVX2). In debug mode its working fine. I tried with saving the content of the XMM8 register and restoring it at end of function call then its working fine. Any idea or references? 回答1: Yes, on Microsoft Windows you are required to preserve the XMM6-XMM15 registers. See http://msdn.microsoft.com/en-us