sse | 易学教程

Speedup a short to float cast?

阅读更多关于 Speedup a short to float cast?

I have a short to float cast in C++ that is bottlenecking my code. The code translates from a hardware device buffer which is natively shorts, this represents the input from a fancy photon counter. float factor= 1.0f/value; for (int i = 0; i < W*H; i++)//25% of time is spent doing this { int value = source[i];//ushort -> int destination[i] = value*factor;//int*float->float } A few details Value should go from 0 to 2^16-1, it represents the pixel values of a highly sensitive camera I'm on a multicore x86 machine with an i7 processor (i7 960 which is SSE 4.2 and 4.1). Source is aligned to an 8

indexing into an array with SSE

阅读更多关于 indexing into an array with SSE

Suppose I have an array: uint8_t arr[256]; and an element __m128i x containing 16 bytes, x_1, x_2, ... x_16 I would like to efficiently fill a new __m128i element __m128i y with values from arr depending on the values in x , such that: y_1 = arr[x_1] y_2 = arr[x_2] . . . y_16 = arr[x_16] A command to achieve this would essentially be loading a register from a non-contiguous set of memory locations. I have a painfully vague memory of having seen documentation of such a command, but can't find it now. Does it exist? Thanks in advance for your help. This kind of capability in SIMD architectures

What's the proper way to use different versions of SSE intrinsics in GCC?

阅读更多关于 What's the proper way to use different versions of SSE intrinsics in GCC?

问题 I will ask my question by giving an example. Now I have a function called do_something() . It has three versions: do_something() , do_something_sse3() , and do_something_sse4() . When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly. The problem is: When I build my program with GCC, I have to set -msse4 for do_something_sse4() to compile (e.g. for the header file <smmintrin.h> to be included). However, if I set

How to multiply two quaternions with minimal instructions?

阅读更多关于 How to multiply two quaternions with minimal instructions?

问题 After some thought, I came up with the following code for multiplying two quaternions using SSE: #include <pmmintrin.h> /* SSE3 intrinsics */ /* multiplication of two quaternions (x, y, z, w) x (a, b, c, d) */ __m128 _mm_cross4_ps(__m128 xyzw, __m128 abcd) { /* The product of two quaternions is: */ /* (X,Y,Z,W) = (xd+yc-zb+wa, -xc+yd+za+wb, xb-ya+zd+wc, -xa-yb-zc+wd) */ __m128 wzyx = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(0,1,2,3)); __m128 baba = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(0,1,0,1

SIMD programming languages

阅读更多关于 SIMD programming languages

问题 In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on programming assembly to get to the really nifty stuff. However, up until now I've hardly been able to find any programming language with built-in support for SIMD. Now obviously there are the shader languages such as HLSL, Cg and GLSL that have native support for this kind of stuff however, I'm looking for

reduction with OpenMP with SSE/AVX

阅读更多关于 reduction with OpenMP with SSE/AVX

I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to: inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i<N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the

How can I check if my installed numpy is compiled with SSE/SSE2 instruction set?

阅读更多关于 How can I check if my installed numpy is compiled with SSE/SSE2 instruction set?

问题 How can I check if my installed version of numpy is compiled with SSE/SSE2 instruction set? I know that some parts of numpy is using BLAS, how to check BLAS too? 回答1: Take a look at: import numpy.distutils.system_info as sysinfo sysinfo.show_all() This will print out all of the information about what numpy was compiled against. 回答2: I think that one way is to use objdump on a numpy.so file if you are under linux, and grep for instruction that are specific to sse. for SSE3 (http://en.wikipedia

Fast counting the number of equal bytes between two arrays [duplicate]

阅读更多关于 Fast counting the number of equal bytes between two arrays [duplicate]

This question already has an answer here: Can counting byte matches between two strings be optimized using SIMD? 3 answers I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16 bytes, so I need deal with this problem. How could I complete the implementation of the function below? How could I improve the

What does ordered / unordered comparison mean?

阅读更多关于 What does ordered / unordered comparison mean?

Looking at the SSE operators CMPORDPS - ordered compare packed singles CMPUNORDPS - unordered compare packed singles What do ordered and unordered mean? I looked for equivalent instructions in the x86 instruction set, and it only seems to have unordered (FUCOM). Mysticial An ordered comparison checks if neither operand is NaN . Conversely, an unordered comparison checks if either operand is a NaN . This page gives some more information on this: http://csapp.cs.cmu.edu/public/waside/waside-sse.pdf (section 5) The idea here is that comparisons with NaN are indeterminate. (can't decide the result

Speed up matrix multiplication by SSE (C++)

阅读更多关于 Speed up matrix multiplication by SSE (C++)

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions. 1) I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement? 2) Do I need the Intel compiler to do it? 3) Can you point out some references? Thanks everybody! The Eigen C++ template library for vectors, matrices, ... has both