simd | 易学教程

_mm_cvtsd_f64 analogon for higher order floating point

阅读更多关于 _mm_cvtsd_f64 analogon for higher order floating point

问题 I'm playing around with SIMD and wonder why there is no analogon to _mm_cvtsd_f64 to extrat the higher order floating point from a __m128d. GCC 4.6+ has an extension which achieves this in a nice way: __m128d a = ...; double d1 = a[0]; double d2 = a[1]; But on older GCC (i.e 4.4.) the only way I could manage to get this is to define my own analogon function using __builtin_ia32_vec_ext_v2df, i.e.: extern __inline double __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm

SSE instruction to check if byte array is zeroes C#

阅读更多关于 SSE instruction to check if byte array is zeroes C#

问题 Suppose I have a byte[] and want to check if all bytes are zeros. For loop is an obvious way to do it, and LINQ All() is a fancy way to do it, but highest performance is critical. How can I use Mono.Simd to speed up checking if byte array is full of zeroes? I am looking for cutting edge approach, not merely correct solution. 回答1: Best code is presented below. Other methods and time measuring are available in full source. static unsafe bool BySimdUnrolled (byte[] data) { fixed (byte* bytes =

Branching on constexpr evaluation / overloading on constexpr

阅读更多关于 Branching on constexpr evaluation / overloading on constexpr

问题 The setup: I have a function that uses SIMD intrinsics and would like to use it inside some constexpr functions. For that, I need to make it constexpr. However, the SIMD intrinsics are not marked constexpr, and the constant evaluator of the compiler cannot handle them. I tried replacing the SIMD intrinsics with a C++ constexpr implementation that does the same thing. The function became 3.5x slower at run-time, but I was able to use it at compile-time (yay?). The problem : How can I use this

Relationship between SSE vectorization and Memory alignment

阅读更多关于 Relationship between SSE vectorization and Memory alignment

问题 Why do we need aligned memory for SSE/AVX? One of the answer I often get is aligned memory load is much faster than unaligned memory load. Then, why is this aligned memory load is much faster than unaligned memory load? 回答1: This is not just specific to SSE (or even x86). On most architectures loads and stores need to be naturally aligned otherwise they either (a) generate an exception or (b) need two or more cycles plus some fix up in order to handle the misaligned load/store transparently.

What is meant by “fixing up” floats?

阅读更多关于 What is meant by “fixing up” floats?

问题 I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples: _mm512_fixupimm_pd, _mm512_mask_fixupimm_pd, _mm512_maskz_fixupimm_pd _mm512_fixupimm_round_pd, _mm512_mask_fixupimm_round_pd, _mm512_maskz_fixupimm_round_pd What is meant here by "fixing up"? 回答1: That's a great question. Intel's answer (my bold) is here: This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that

implict SIMD (SSE/AVX) broadcasts with GCC

阅读更多关于 implict SIMD (SSE/AVX) broadcasts with GCC

问题 I have manged to convert most of my SIMD code to us the vector extensions of GCC. However, I have not found a good solution for doing a broadcast as follows __m256 areg0 = _mm256_broadcast_ss(&a[i]); I want to do __m256 argeg0 = a[i]; If you see my answer at Mutiplying vector by constant using SSE I managed to get broadcasts working with another SIMD register. The following works: __m256 x,y; y = x + 3.14159f; // broadcast x + 3.14159 y = 3.14159f*x; // broadcast 3.14159*x but this won't work

Using SSE in C#

阅读更多关于 Using SSE in C#

问题 I'm currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code itself is also perfect for SSE (as it's matrix and vectorbased), so I went ahead and started to use Mono.Simd and even though this made a significant difference in execution time, this still isn't enough. The problem with Mono.Simd is that it only has very old SSE-instruction (mainly from SSE1 en SSE2, I believe), which

Can counting byte matches between two strings be optimized using SIMD?

阅读更多关于 Can counting byte matches between two strings be optimized using SIMD?

问题 Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be

Can counting byte matches between two strings be optimized using SIMD?

阅读更多关于 Can counting byte matches between two strings be optimized using SIMD?

OpenCL distribution

阅读更多关于 OpenCL distribution

问题 I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions. My question however is regarding which OpenCL-implementation to use. E.g