simd

Does GPGPU programming only allow the execution of SIMD instructions?

浪子不回头ぞ 提交于 2020-01-01 17:03:48
问题 Does GPGPU programming only allow the execution of SIMD instructions? If so then it must be a tedious task to re write an algorithm that has been designed to run on a general CPU to run on a GPU? Also is there a pattern in algorithms that can be converted to SIMD architecture? 回答1: Well, it's not quite exact that GPGPU only supports SIMD execution. Many GPUs have some non-SIMD components. But, overall, to take full advantage of a GPU you need to be running SIMD code. However, you are NOT

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

大憨熊 提交于 2020-01-01 11:33:51
问题 I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why. By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine. 回答1: On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not

Shuffle elements of __m256i vector

余生长醉 提交于 2020-01-01 10:16:42
问题 I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions? 回答1: There is a way to emulate this operation, but it is not very beautiful: const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,

Optimizing Array Compaction

倖福魔咒の 提交于 2020-01-01 01:16:12
问题 Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5 4] Here's how I currently do it (MATLAB pseudocode): function out = compact( in ) d = in for i = 1:size(in, 2) %do (# of items in in) passes m = d > 0 %shift left, pad w/ 0 on right ml = [m(2:end) 0] % shift dl = [d(2:end) 0] % shift %if the data

CPU SIMD vs GPU SIMD?

若如初见. 提交于 2019-12-31 11:42:43
问题 GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set. However, CPU also uses SIMD, and provide instruction level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism. While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs ? In which way the parallel computational capabilities in a CPU are 'weaker'

How to convert 32-bit float to 8-bit signed char?

馋奶兔 提交于 2019-12-30 22:53:08
问题 What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127]. I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :( I also wrote some code that converts 32-bit float to 16-bit int,

Fast SSE low precision exponential using double precision operations

与世无争的帅哥 提交于 2019-12-30 09:07:35
问题 I am looking for for a fast-SSE-low-precision (~1e-3) exponential function. I came across this great answer: /* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */ __m128 FastExpSse (__m128 x) { __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765); __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b); return _mm_castsi128_ps (t); } Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast,

Logarithm with SSE, or switch to FPU?

∥☆過路亽.° 提交于 2019-12-30 08:23:21
问题 I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is: To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use. Is it better to: extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to

Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?

好久不见. 提交于 2019-12-30 06:24:56
问题 Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators MIC Intel Xeon Phi? http://en.wikipedia.org/wiki/Xeon_Phi 回答1: Yes, current generation of Intel Xeon Phi co-processors (codename "Knight's Corner" , abbreviated KNC) supports 512-bit SIMD instruction set called "Intel® Initial Many Core Instructions" (abbreviated Intel® IMCI ). Intel IMCI is not "compatible with" and is not equialent to SSE, AVX, AVX2 or AVX-512 ISA. However it's officially announced that next planned

Is NOT missing from SSE, AVX?

本秂侑毒 提交于 2019-12-30 06:18:32
问题 Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach. 回答1: For cases such as this it can be instructive to see what a compiler would generate. E.g. for the following function: #include <immintrin.h> __m256i test(const __m256i v) { return ~v; } both gcc and