simd

Porting MMX/SSE instructions to AltiVec

天涯浪子 提交于 2019-12-07 13:03:02
问题 Let me preface this with.. I have extremely limited experience with ASM, and even less with SIMD. But it happens that I have the following MMX/SSE optimised code, that I would like to port across to AltiVec instructions for use on PPC/Cell processors. This is probably a big ask.. Even though it's only a few lines of code, I've had no end of trouble trying to work out what's going on here. The original function: static inline int convolve(const short *a, const short *b, int n) { int out = 0;

“Extend” data type size in SSE register

喜你入骨 提交于 2019-12-07 13:01:04
问题 I'm using VS2005 (at work) and need an SSE intrinsic that does the following: I have a pre-existing __m128i n filled with 16 bit integers a_1,a_2,....,a_8 . Since some calculations that I now want to do require 32 instead of 16 bits, I want to extract the two four-sets of 16-bit integers from n and put them into two separated __m128i s which contain a_1,...,a_4 and a_5,...,a_8 respectively. I could do this manually using the various _mm_set intrinsics, but those would result in eight mov s in

Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

天涯浪子 提交于 2019-12-07 11:27:43
问题 Suppose we have an std::vector , or any other sequence container (sometimes it will be a deque), which store uint64_t elements. Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word. The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as

determinant calculation with SIMD

这一生的挚爱 提交于 2019-12-07 09:23:33
问题 Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. 回答1: Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This

Horizontal trailing maximum on AVX or SSE

安稳与你 提交于 2019-12-07 07:07:03
问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Does .NET Framework 4.5 provide SSE4/AVX support?

…衆ロ難τιáo~ 提交于 2019-12-07 06:30:35
问题 I think, I heard about that, but don't know where. upd: I told about JiT 回答1: it seem that it is coming. (I just found out an hour ago) here a few links The JIT finally proposed. JIT and SIMD are getting married. Update to SIMD Support you need the latest version of RyuJIT and Microsoft SIMD-enabled Vector Types (Nuget) 回答2: No, there's no scenario in .NET where you can write machine code yourself. Code generation is entirely up to the just-in-time compiler. It is certainly capable of

Add all elements in a lane

半世苍凉 提交于 2019-12-07 06:09:30
问题 Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised): int16_t p[8], q[8], r[8]; int32_t sum; int16x8_t pneon, qneon, result; p[0] = some_number; p[1] = some_other_number; //etc etc pneon = vld1q_s16(p); q[0] = some_other_other_number; q[1] = some_other_other_other_number; //etc etc qneon = vld1q

Simd matmul program gives different numerical results

对着背影说爱祢 提交于 2019-12-07 05:22:16
问题 I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients. REAL_T is just a float with typedef /* This is my matmul Version with simd, using floating simple precision*/ void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){ int i,j,k; __m256 vA, vB, vC, vRes; for (i=0; i<n; i++){ for (j=0; j<n; j++){ for (k=0; k<n; k= k+8){

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

青春壹個敷衍的年華 提交于 2019-12-07 03:46:31
问题 The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops. I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)... I don't seem to figure

What is the fastest way to do a SIMD gather without AVX(2)?

我的未来我决定 提交于 2019-12-07 03:11:03
问题 Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers): a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 Into four vectors a, b, c, d ? a: {a0, a1, a2, a3} b: {b0, b1, b2, b3} c: {c0, c1, c2, c3} d: {d0, d1, d2, d3} I'm not sure whether this is relevant or not, but in my actual application I have 16 vectors and as such a0 and a1 are 16*4 bytes apart in memory. 回答1: What you need here is 4 loads followed by a 4x4