simd | 易学教程

Loop is not vectorized when variable extent is used

阅读更多关于 Loop is not vectorized when variable extent is used

问题 Version A code is not vectorized while version B code is vectorized. How to make version A vectorize and keep the variable extents (without using literal extents)? The nested loop is for multiplication with broadcasting as in numpy library of python and matlab. Description of broadcasting in numpy library is here. Version A code (no std::vector. no vectorization.) This only uses imull (%rsi), %edx in .L169 , which is not a SIMD instruction. gcc godbolt #include <iostream> #include <stdint.h>

SSE Loading & Adding

阅读更多关于 SSE Loading & Adding

问题 Assume I have two vectors represented by two arrays of type double , each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1 , I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together. Since the type is double , I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself. My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]) , will that place them in the lower and

How to optimize SIMD transpose function (8x4 => 4x8)?

阅读更多关于 How to optimize SIMD transpose function (8x4 => 4x8)?

问题 I need to optimize the transpose of 8x4 and 4x8 float matrices with AVX. I use Agner Fog's vector class library. The teal task - build BVH and sum min-max. Transposing is used in final stage of every loop (they also optimized by multi-thread, but tasks can be really much). Code now looks like: void transpose(register Vec4f (&fin)[8], register Vec8f (&mat)[4]) { for (int i = 0;i < 8;i++) { fin[i] = lookup<28>(Vec4i(0, 8, 16, 24) + i, (float *)mat); } } Needs variants of optimization. How to

Testing whether AVX register contains some equal integer numbers

阅读更多关于 Testing whether AVX register contains some equal integer numbers

问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

Best way to shuffle 64-bit portions of two __m128i's

阅读更多关于 Best way to shuffle 64-bit portions of two __m128i's

问题 I have two __m128i s, a and b , that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst . i.e. dst[ 0:63] = a[64:127] dst[64:127] = b[0:63] Equivalent to: __m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b); or __m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1)); Is there a better way to do this than the first method? The second one is just one instruction, but the

Is there still any development on SIMD in Mono?

阅读更多关于 Is there still any development on SIMD in Mono?

问题 I want to know if there has been/is any development on Mono.SIMD (or SIMD-support in general inside Mono) ever since it came out 5(!) years ago. I personally think this was a great step in improving speed for c#. However, I've used it for some time now and I'm feeling that Mono.SIMD is falling behind, as lots of functions are missing. Some of the problems i'm facing include: The lack of a dot product, which can be implemented in 1 operation ever since SSE4.1 (which came out in 2006 and is now

What is the fastest way for adding the vector elements horizontally in odd order?

阅读更多关于 What is the fastest way for adding the vector elements horizontally in odd order?

问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is

How can I count the occurrence of a byte in array using SIMD?

阅读更多关于 How can I count the occurrence of a byte in array using SIMD?

问题 Given the following input bytes: var vBytes = new Vector<byte>(new byte[] {72, 101, 55, 08, 108, 111, 55, 87, 111, 114, 108, 55, 100, 55, 55, 20}); And the given mask: var mask = new Vector<byte>(55); How can I find the count of byte 55 in the input array? I have tried xoring the vBytes with the mask : var xored = Vector.Xor(mask, vBytes); which gives: <127, 82, 0, 91, 91, 88, 0, 96, 88, 69, 91, 0, 83, 0, 0, 35> But don't know how I can get the count from that. For the sake of simplicity let

The impact of goto instruction at intra-warp divergence in CUDA code

阅读更多关于 The impact of goto instruction at intra-warp divergence in CUDA code

问题 For simple intra-warp thread divergence in CUDA, what I know is that SM selects a re-convergence point (PC address), and executes instructions in both/multiple paths while disabling effects of execution for the threads that haven't taken the path. For example, in below piece of code: if( threadIdx.x < 16 ) { A: // do something. } else { B: // do something else. } C: // rest of code. C is the re-convergence point, warp scheduler schedules instructions at both A and B , while disabling

vectorize a loop which accesses non-consecutive memory locations

阅读更多关于 vectorize a loop which accesses non-consecutive memory locations

问题 I have a loop of this structure Reference : Maxwell Code Example do z=1,zend do y=1,yend do x=1,xend k=arr(x,y,z) do while(k.ne.0) ix=fooX(k) iy=fooY(k) iz=fooZ(k) x1=x(ix ,iy ,iz) x2=x(ix+1,iy ,iz) x3=x(ix ,iy+1,iz) x4=x(ix+1,iy+1,iz) x5=x(ix ,iy ,iz+1) x6=x(ix+1,iy ,iz+1) x7=x(ix ,iy+1,iz+1) x8=x(ix+1,iy+1,iz+1) y1=y(ix ,iy ,iz) y2=y(ix+1,iy ,iz) y3=y(ix ,iy+1,iz) y4=y(ix+1,iy+1,iz) y5=y(ix ,iy ,iz+1) y6=y(ix+1,iy ,iz+1) y7=y(ix ,iy+1,iz+1) y8=y(ix+1,iy+1,iz+1) z1=z(ix ,iy ,iz) z2=z(ix+1,iy