simd | 易学教程

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

阅读更多关于 MSVC /arch:[instruction set] - SSE3, AVX, AVX2

问题 Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless. So, question is: what exactly this flag does? Enables support or enables compiler

Fast byte-wise replace if

阅读更多关于 Fast byte-wise replace if

问题 I have a function that copies binary data from one area to another, but only if the bytes are different from a specific value. Here is a code sample: void copy_if(char* src, char* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { if (src[i] != ignore) dest[i] = src[i]; } } The problem is that this is too slow for my current need. Is there a way to obtain the same result in a faster way? Update: Based on answers I tried two new implementations: void copy_if_vectorized(const

Why C# is twice as slow as C++ even though the generated machine code is nearly identical?

阅读更多关于 Why C# is twice as slow as C++ even though the generated machine code is nearly identical?

问题 This code was generated by .NET Core 3.0 JIT, for my manually vectorized C# code: 00007FFE6C7D2103 vmovdqu xmm5,xmmword ptr [rcx] 00007FFE6C7D2107 vmovdqu xmm6,xmmword ptr [rcx+10h] 00007FFE6C7D210C vmovdqu xmm7,xmmword ptr [rcx+20h] 00007FFE6C7D2111 vmovdqu xmm8,xmmword ptr [rcx+30h] 00007FFE6C7D2116 vpand xmm9,xmm5,xmm0 00007FFE6C7D211A vpand xmm10,xmm6,xmm0 00007FFE6C7D211E vpackusdw xmm9,xmm9,xmm10 00007FFE6C7D2123 vpslldq xmm9,xmm9,1 00007FFE6C7D2129 vpand xmm10,xmm5,xmm1

How to create a 8 bit mask from lsb of __m64 value?

阅读更多关于 How to create a 8 bit mask from lsb of __m64 value?

问题 I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer? 回答1: There is no direct way to do it, but obviously you can

Dereference pointers in XMM register (gather)

阅读更多关于 Dereference pointers in XMM register (gather)

问题 If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory? Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with

Does anybody know how to use Neon intrinsics uint8x8_t vclt_s8 (int8x8_t, int8x8_t)

阅读更多关于 Does anybody know how to use Neon intrinsics uint8x8_t vclt_s8 (int8x8_t, int8x8_t)

问题 I want to compare 2 int8x8_t , From http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html we can get the description for vclt_s8 , but it does not tell us much details. `uint8x8_t vclt_s8 (int8x8_t, int8x8_t)` Form of expected instruction(s): vcgt.s8 d0, d0, d0 the return value uint8x8_t , it confuse me for I can not use if(vclt_s8(a, b)) to decide the first is smaller. Then suppose we have two int8x8_t : int8x8_t a and int8x8_t b , how do we know whether a is smaller? 回答1: You may find

Shift right every DW in a __m128i by a different amount

阅读更多关于 Shift right every DW in a __m128i by a different amount

问题 I want to shift right every element of a __m128i register by a different amount.I know this is possible by multiplication if we want to shift left like below: __m128i mul_constant = _mm_set_epi32(8, 4, 2, 1); __m128i left_vshift = _mm_mullo_epi32(R, mul_constant); But, what is the solution if we want to shift it right? 回答1: I finally did it like below: Shifting every byte by a different amount to left and then a 32-bit right shift by 3 gave me what I wanted. R = _mm_mullo_epi32(R, _mm_set

Is there any instructions sets support MIMD arch?

阅读更多关于 Is there any instructions sets support MIMD arch?

问题 I have already known SIMD instructions sets contains SSE1 to SSE5. But not found too much talk about any instruction sets support MIMD arch. In c++ code , we can use intrinsic to write "SIMD running" code. Is there any way to write "MIMD running" code ? If MIMD is more powerful than SIMD, it is better to write c++ code support MIMD. Is my thought correct ? 回答1: The Wikipedia page Flynn's taxonomy describes MIMD as: Multiple autonomous processors simultaneously executing different instructions

Is possible to address the output SIMD register by using an input register

阅读更多关于 Is possible to address the output SIMD register by using an input register

问题 Is it possible to use the scalar values of an input vector to index the output vector? I try to implement the following function in SIMD but I can not find any solution. void shuffle(unsigned char * a, // input a unsigned char * r){ // output r for (i=0; i < 16; i++) r[i] = 0; for (i=0; i < 16; i++) r[a[i] % 16] = 1; } An example input / output vector would look like this unsigned char * a = {0, 0, 0, 10, 0, 0, 0, 2, 0, 0, 0, 0, 3, 1, 0, 0 }; ... do SIMD magic // 0 1 2 3 4 5 6 7 8 9 10 11 12

how to break from a loop when using sse intrinsics?

阅读更多关于 how to break from a loop when using sse intrinsics?

问题 __m128* pSrc1 = (__m128*) string; __m128 m0 = _mm_set_ps1(0); //null character while(1) { __m128 result = __m128 _mm_cmpeq_ss(*pSrc1, m0); //if character is \0 then break //do some stuff here pSrc1++; } I have a string whose length can be a multiple of 16. How do I break out of the loop if _mm_cmpeq_ss returns equal? 回答1: If you're trying to break out of the loop when you first encounter a \0 then you'll need to do something like this: __m128i* pSrc1 = (__m128i *)string; // init pointer to