simd | 易学教程

Does GPGPU programming only allow the execution of SIMD instructions?

阅读更多关于 Does GPGPU programming only allow the execution of SIMD instructions?

问题 Does GPGPU programming only allow the execution of SIMD instructions? If so then it must be a tedious task to re write an algorithm that has been designed to run on a general CPU to run on a GPU? Also is there a pattern in algorithms that can be converted to SIMD architecture? 回答1: Well, it's not quite exact that GPGPU only supports SIMD execution. Many GPUs have some non-SIMD components. But, overall, to take full advantage of a GPU you need to be running SIMD code. However, you are NOT

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

阅读更多关于 Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

问题 I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why. By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine. 回答1: On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not

Shuffle elements of __m256i vector

阅读更多关于 Shuffle elements of __m256i vector

问题 I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions? 回答1: There is a way to emulate this operation, but it is not very beautiful: const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,

Optimizing Array Compaction

阅读更多关于 Optimizing Array Compaction

问题 Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5 4] Here's how I currently do it (MATLAB pseudocode): function out = compact( in ) d = in for i = 1:size(in, 2) %do (# of items in in) passes m = d > 0 %shift left, pad w/ 0 on right ml = [m(2:end) 0] % shift dl = [d(2:end) 0] % shift %if the data

CPU SIMD vs GPU SIMD?

阅读更多关于 CPU SIMD vs GPU SIMD?

问题 GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set. However, CPU also uses SIMD, and provide instruction level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism. While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs ? In which way the parallel computational capabilities in a CPU are 'weaker'

How to convert 32-bit float to 8-bit signed char?

阅读更多关于 How to convert 32-bit float to 8-bit signed char?

问题 What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127]. I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :( I also wrote some code that converts 32-bit float to 16-bit int,

Fast SSE low precision exponential using double precision operations

阅读更多关于 Fast SSE low precision exponential using double precision operations

问题 I am looking for for a fast-SSE-low-precision (~1e-3) exponential function. I came across this great answer: /* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */ __m128 FastExpSse (__m128 x) { __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765); __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b); return _mm_castsi128_ps (t); } Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast,

Logarithm with SSE, or switch to FPU?

阅读更多关于 Logarithm with SSE, or switch to FPU?

问题 I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is: To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use. Is it better to: extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to

Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?

阅读更多关于 Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?

问题 Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators MIC Intel Xeon Phi? http://en.wikipedia.org/wiki/Xeon_Phi 回答1: Yes, current generation of Intel Xeon Phi co-processors (codename "Knight's Corner" , abbreviated KNC) supports 512-bit SIMD instruction set called "Intel® Initial Many Core Instructions" (abbreviated Intel® IMCI ). Intel IMCI is not "compatible with" and is not equialent to SSE, AVX, AVX2 or AVX-512 ISA. However it's officially announced that next planned

Is NOT missing from SSE, AVX?

阅读更多关于 Is NOT missing from SSE, AVX?

问题 Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach. 回答1: For cases such as this it can be instructive to see what a compiler would generate. E.g. for the following function: #include <immintrin.h> __m256i test(const __m256i v) { return ~v; } both gcc and