simd

Loop is not vectorized when variable extent is used

拥有回忆 提交于 2019-12-23 17:13:38
问题 Version A code is not vectorized while version B code is vectorized. How to make version A vectorize and keep the variable extents (without using literal extents)? The nested loop is for multiplication with broadcasting as in numpy library of python and matlab. Description of broadcasting in numpy library is here. Version A code (no std::vector. no vectorization.) This only uses imull (%rsi), %edx in .L169 , which is not a SIMD instruction. gcc godbolt #include <iostream> #include <stdint.h>

SSE Loading & Adding

最后都变了- 提交于 2019-12-23 12:52:31
问题 Assume I have two vectors represented by two arrays of type double , each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1 , I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together. Since the type is double , I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself. My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]) , will that place them in the lower and

How to optimize SIMD transpose function (8x4 => 4x8)?

只愿长相守 提交于 2019-12-23 12:38:55
问题 I need to optimize the transpose of 8x4 and 4x8 float matrices with AVX. I use Agner Fog's vector class library. The teal task - build BVH and sum min-max. Transposing is used in final stage of every loop (they also optimized by multi-thread, but tasks can be really much). Code now looks like: void transpose(register Vec4f (&fin)[8], register Vec8f (&mat)[4]) { for (int i = 0;i < 8;i++) { fin[i] = lookup<28>(Vec4i(0, 8, 16, 24) + i, (float *)mat); } } Needs variants of optimization. How to

Testing whether AVX register contains some equal integer numbers

放肆的年华 提交于 2019-12-23 12:15:26
问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

Best way to shuffle 64-bit portions of two __m128i's

我们两清 提交于 2019-12-23 07:49:54
问题 I have two __m128i s, a and b , that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst . i.e. dst[ 0:63] = a[64:127] dst[64:127] = b[0:63] Equivalent to: __m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b); or __m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1)); Is there a better way to do this than the first method? The second one is just one instruction, but the

Is there still any development on SIMD in Mono?

丶灬走出姿态 提交于 2019-12-23 07:46:03
问题 I want to know if there has been/is any development on Mono.SIMD (or SIMD-support in general inside Mono) ever since it came out 5(!) years ago. I personally think this was a great step in improving speed for c#. However, I've used it for some time now and I'm feeling that Mono.SIMD is falling behind, as lots of functions are missing. Some of the problems i'm facing include: The lack of a dot product, which can be implemented in 1 operation ever since SSE4.1 (which came out in 2006 and is now

What is the fastest way for adding the vector elements horizontally in odd order?

纵然是瞬间 提交于 2019-12-23 04:15:28
问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is

How can I count the occurrence of a byte in array using SIMD?

隐身守侯 提交于 2019-12-23 03:48:12
问题 Given the following input bytes: var vBytes = new Vector<byte>(new byte[] {72, 101, 55, 08, 108, 111, 55, 87, 111, 114, 108, 55, 100, 55, 55, 20}); And the given mask: var mask = new Vector<byte>(55); How can I find the count of byte 55 in the input array? I have tried xoring the vBytes with the mask : var xored = Vector.Xor(mask, vBytes); which gives: <127, 82, 0, 91, 91, 88, 0, 96, 88, 69, 91, 0, 83, 0, 0, 35> But don't know how I can get the count from that. For the sake of simplicity let

The impact of goto instruction at intra-warp divergence in CUDA code

ぐ巨炮叔叔 提交于 2019-12-22 13:58:16
问题 For simple intra-warp thread divergence in CUDA, what I know is that SM selects a re-convergence point (PC address), and executes instructions in both/multiple paths while disabling effects of execution for the threads that haven't taken the path. For example, in below piece of code: if( threadIdx.x < 16 ) { A: // do something. } else { B: // do something else. } C: // rest of code. C is the re-convergence point, warp scheduler schedules instructions at both A and B , while disabling

vectorize a loop which accesses non-consecutive memory locations

佐手、 提交于 2019-12-22 11:13:06
问题 I have a loop of this structure Reference : Maxwell Code Example do z=1,zend do y=1,yend do x=1,xend k=arr(x,y,z) do while(k.ne.0) ix=fooX(k) iy=fooY(k) iz=fooZ(k) x1=x(ix ,iy ,iz) x2=x(ix+1,iy ,iz) x3=x(ix ,iy+1,iz) x4=x(ix+1,iy+1,iz) x5=x(ix ,iy ,iz+1) x6=x(ix+1,iy ,iz+1) x7=x(ix ,iy+1,iz+1) x8=x(ix+1,iy+1,iz+1) y1=y(ix ,iy ,iz) y2=y(ix+1,iy ,iz) y3=y(ix ,iy+1,iz) y4=y(ix+1,iy+1,iz) y5=y(ix ,iy ,iz+1) y6=y(ix+1,iy ,iz+1) y7=y(ix ,iy+1,iz+1) y8=y(ix+1,iy+1,iz+1) z1=z(ix ,iy ,iz) z2=z(ix+1,iy