sse

C style cast versus intrinsic cast

北城以北 提交于 2019-12-13 14:19:39
问题 Let's assume i have defined __m256d x and that I want to extract the lower 128-bits. I would do: __m128d xlow = _mm256_castpd256_pd128(x); However, I recently saw someone do: __m128d xlow = (__m128d) x Is there a prefered method to use for the cast? Why use the first method? 来源: https://stackoverflow.com/questions/20401413/c-style-cast-versus-intrinsic-cast

SSE alpha blending for pre-multiplied ARGB

两盒软妹~` 提交于 2019-12-13 13:39:45
问题 I'm trying to write an SSE-enabled alpha compositor, this is what I've come up with. First, the code to blend two vectors of 4-pixels each: // alpha blend two 128-bit (16 byte) SSE vectors containing 4 pre-multiplied ARGB values each // __attribute__((always_inline)) static inline __m128i blend4(__m128i under, __m128i over) { // shuffle masks for alpha and 255 vector for 255-alpha // // NOTE: storing static __m128i here with _mm_set_si128 was _very_ slow, compiler doesn't seem // to know it

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

给你一囗甜甜゛ 提交于 2019-12-13 13:25:37
问题 I'm trying to apply the log2 onto a __m128 variable. Like this: #include <immintrin.h> int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this returns this error: error: initializing '__m128' with an expression of incompatible type 'int' __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) ^ ~~~~~~~~~~~~~~~~~~ How can I fix it? 回答1: The immintrin.h you look into and immintrin.h used for compilation are

Compact a hex number

时光总嘲笑我的痴心妄想 提交于 2019-12-13 12:25:30
问题 Is there a clever (ie: branchless) way to "compact" a hex number. Basically move all the 0s all to one side? eg: 0x10302040 -> 0x13240000 or 0x10302040 -> 0x00001324 I looked on Bit Twiddling Hacks but didn't see anything. It's for a SSE numerical pivoting algorithm. I need to remove any pivots that become 0. I can use _mm_cmpgt_ps to find good pivots, _mm_movemask_ps to convert that in to a mask, and then bit hacks to get something like the above. The hex value gets munged in to a mask for a

Fast byte-wise replace if

倖福魔咒の 提交于 2019-12-13 12:03:45
问题 I have a function that copies binary data from one area to another, but only if the bytes are different from a specific value. Here is a code sample: void copy_if(char* src, char* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { if (src[i] != ignore) dest[i] = src[i]; } } The problem is that this is too slow for my current need. Is there a way to obtain the same result in a faster way? Update: Based on answers I tried two new implementations: void copy_if_vectorized(const

Dereference pointers in XMM register (gather)

*爱你&永不变心* 提交于 2019-12-13 08:07:27
问题 If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory? Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with

SSE4.1 automatically put in string comparison on newer gcc

走远了吗. 提交于 2019-12-13 06:06:29
问题 I searched the gcc 4.8.1 documents but couldn't find an answer to this: I have some SSE4.1 code and fallback code, at runtime I detect whether the system supports SSE4.1 and in case it doesn't, I use the fallback code. So far so good, but with latest gcc versions this is what happens: - my application crashes because SSE4.1 instructions are being spread throughout the code every time a string comparison is performed Since I'm compiling all my files with -msse41 this sounds reasonable but

Shift right every DW in a __m128i by a different amount

浪尽此生 提交于 2019-12-13 03:31:49
问题 I want to shift right every element of a __m128i register by a different amount.I know this is possible by multiplication if we want to shift left like below: __m128i mul_constant = _mm_set_epi32(8, 4, 2, 1); __m128i left_vshift = _mm_mullo_epi32(R, mul_constant); But, what is the solution if we want to shift it right? 回答1: I finally did it like below: Shifting every byte by a different amount to left and then a 32-bit right shift by 3 gave me what I wanted. R = _mm_mullo_epi32(R, _mm_set

Is possible to address the output SIMD register by using an input register

有些话、适合烂在心里 提交于 2019-12-13 02:06:58
问题 Is it possible to use the scalar values of an input vector to index the output vector? I try to implement the following function in SIMD but I can not find any solution. void shuffle(unsigned char * a, // input a unsigned char * r){ // output r for (i=0; i < 16; i++) r[i] = 0; for (i=0; i < 16; i++) r[a[i] % 16] = 1; } An example input / output vector would look like this unsigned char * a = {0, 0, 0, 10, 0, 0, 0, 2, 0, 0, 0, 0, 3, 1, 0, 0 }; ... do SIMD magic // 0 1 2 3 4 5 6 7 8 9 10 11 12

Saving the XMM register before function call

倖福魔咒の 提交于 2019-12-13 01:35:06
问题 Is it required to save/push the any XMM registers to the stack before the assembly function call? Because am observing the crash issue in my code with release mode for 64-bit development(Using AVX2). In debug mode its working fine. I tried with saving the content of the XMM8 register and restoring it at end of function call then its working fine. Any idea or references? 回答1: Yes, on Microsoft Windows you are required to preserve the XMM6-XMM15 registers. See http://msdn.microsoft.com/en-us