simd | 易学教程

Porting MMX/SSE instructions to AltiVec

阅读更多关于 Porting MMX/SSE instructions to AltiVec

问题 Let me preface this with.. I have extremely limited experience with ASM, and even less with SIMD. But it happens that I have the following MMX/SSE optimised code, that I would like to port across to AltiVec instructions for use on PPC/Cell processors. This is probably a big ask.. Even though it's only a few lines of code, I've had no end of trouble trying to work out what's going on here. The original function: static inline int convolve(const short *a, const short *b, int n) { int out = 0;

“Extend” data type size in SSE register

阅读更多关于 “Extend” data type size in SSE register

问题 I'm using VS2005 (at work) and need an SSE intrinsic that does the following: I have a pre-existing __m128i n filled with 16 bit integers a_1,a_2,....,a_8 . Since some calculations that I now want to do require 32 instead of 16 bits, I want to extract the two four-sets of 16-bit integers from n and put them into two separated __m128i s which contain a_1,...,a_4 and a_5,...,a_8 respectively. I could do this manually using the various _mm_set intrinsics, but those would result in eight mov s in

Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

阅读更多关于 Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

问题 Suppose we have an std::vector , or any other sequence container (sometimes it will be a deque), which store uint64_t elements. Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word. The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as

determinant calculation with SIMD

阅读更多关于 determinant calculation with SIMD

问题 Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. 回答1: Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This

Horizontal trailing maximum on AVX or SSE

阅读更多关于 Horizontal trailing maximum on AVX or SSE

问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Does .NET Framework 4.5 provide SSE4/AVX support?

阅读更多关于 Does .NET Framework 4.5 provide SSE4/AVX support?

问题 I think, I heard about that, but don't know where. upd: I told about JiT 回答1: it seem that it is coming. (I just found out an hour ago) here a few links The JIT finally proposed. JIT and SIMD are getting married. Update to SIMD Support you need the latest version of RyuJIT and Microsoft SIMD-enabled Vector Types (Nuget) 回答2: No, there's no scenario in .NET where you can write machine code yourself. Code generation is entirely up to the just-in-time compiler. It is certainly capable of

Add all elements in a lane

阅读更多关于 Add all elements in a lane

问题 Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised): int16_t p[8], q[8], r[8]; int32_t sum; int16x8_t pneon, qneon, result; p[0] = some_number; p[1] = some_other_number; //etc etc pneon = vld1q_s16(p); q[0] = some_other_other_number; q[1] = some_other_other_other_number; //etc etc qneon = vld1q

Simd matmul program gives different numerical results

阅读更多关于 Simd matmul program gives different numerical results

问题 I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients. REAL_T is just a float with typedef /* This is my matmul Version with simd, using floating simple precision*/ void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){ int i,j,k; __m256 vA, vB, vC, vRes; for (i=0; i<n; i++){ for (j=0; j<n; j++){ for (k=0; k<n; k= k+8){

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

阅读更多关于 How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

问题 The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure

What is the fastest way to do a SIMD gather without AVX(2)?

阅读更多关于 What is the fastest way to do a SIMD gather without AVX(2)?

问题 Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers): a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 Into four vectors a, b, c, d ? a: {a0, a1, a2, a3} b: {b0, b1, b2, b3} c: {c0, c1, c2, c3} d: {d0, d1, d2, d3} I'm not sure whether this is relevant or not, but in my actual application I have 16 vectors and as such a0 and a1 are 16*4 bytes apart in memory. 回答1: What you need here is 4 loads followed by a 4x4