sse | 易学教程

How to load a pixel struct into an SSE register?

阅读更多关于 How to load a pixel struct into an SSE register?

问题 I have a struct of 8-bit pixel data: struct __attribute__((aligned(4))) pixels { char r; char g; char b; char a; } I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers? 回答1: Unpacking unsigned pixels with SSE2 Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register: __m128i xmm0 = _mm_cvtsi32_si128(*

How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

阅读更多关于 How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

问题 Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles. double sum = 0.0; for(k = 0; k < 3072; k++) { sum += fabs(sima[k] - simb[k]); } double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0))); Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast

Write x86 asm functions portably (win/linux/osx), without a build-depend on yasm/nasm?

阅读更多关于 Write x86 asm functions portably (win/linux/osx), without a build-depend on yasm/nasm?

问题 par2 has a small and fairly clean C++ codebase, which I think builds fine on GNU/Linux, OS X, and Windows (with MSVC++). I'd like to incorporate an x86-64 asm version of the one function that takes nearly all the CPU time. (mailing list posts with more details. My implementation/benchmark here.) Intrinsics would be the obvious solution, but gcc doesn't generate good enough code for getting one byte at a time from a 64bit register for use as an index into a LUT. I might also take the time to

How to use bits in a byte to set dwords in ymm register without AVX2? (Inverse of vmovmskps)

阅读更多关于 How to use bits in a byte to set dwords in ymm register without AVX2? (Inverse of vmovmskps)

问题 What I'm trying to achieve is based on each bit in a byte, set to all ones in each dword in a ymm register (or memory location) e.g. al = 0110 0001 ymm0 = 0x00000000 FFFFFFFF FFFFFFFF 00000000 00000000 00000000 00000000 FFFFFFFF i.e. an inverse of vmovmskps eax, ymm0 / _mm256_movemask_ps , turning a bitmap into a vector mask. I'm thinking there are a handful of sse/avx instructions that can do this relatively simply but I haven't been able to work it out. Preferably sandy bridge compatible so

Reverse a AVX register containing doubles using a single AVX intrinsic

阅读更多关于 Reverse a AVX register containing doubles using a single AVX intrinsic

问题 If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command? For example: If I had 4 floats in a SSE register, I could use: _mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3)); Can I do this using, maybe _mm256_permute2f128_pd() ? I don't think you can address each individual double using the above intrinsic. 回答1: You actually need 2 permutes to do this: _mm256_permute2f128_pd() only permutes in

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

阅读更多关于 How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

问题 How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? I am about: pow(x, y) = exp(y*log(x)) I.e. do both exp() and log() AVX x86_64 instructions require certain known number of cycles? exp(): _mm256_exp_ps() log(): _mm256_log_ps() Or the number of cycles may vary depending on the exponential degree, is there the maximum number of cycles can cost exponentiation? 回答1: The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp , log , or

Shifting 4 integers right by different values SIMD

阅读更多关于 Shifting 4 integers right by different values SIMD

问题 SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this. i[0] = i[0] & 0b111111; i[1] = (i[1]>>6) & 0b111111; i[2] = (i[2]>>12) & 0b111111; i[3] = (i[3]>>18) & 0b111111; Essentially trying to isolate a different group of 6 bits in each integer. So what is the optimal solution? Things I thought about: You can simulate a variable

How to enable sse3 autovectorization in gcc

阅读更多关于 How to enable sse3 autovectorization in gcc

问题 I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics but I am interested in whether it is possible to get gcc to auto-vectorize the code. Here is some sample code #include <complex.h> complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *= x[i]; return p; } The assembly you get

Compilation of a simple c++ program using SSE intrinsics

阅读更多关于 Compilation of a simple c++ program using SSE intrinsics

问题 I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU Here is a code based on the article which I attempted: For two arrays of length ARRAY_SIZE it calculates fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5 Here is the code #include <iostream> #include <iomanip> #include <ctime> #include

Fast vectorized conversion from RGB to BGRA

阅读更多关于 Fast vectorized conversion from RGB to BGRA

问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char