simd | 易学教程

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

阅读更多关于 MSVC /arch:[instruction set] - SSE3, AVX, AVX2

Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless. So, question is: what exactly this flag does? Enables support or enables compiler optimizations using provided instruction sets ? In other words, can I completely remove this flag and keep using

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

阅读更多关于 Does gcc use Intel's SSE 4.2 instructions for text processing if available?

问题 I read here that Intel introduced SSE 4.2 instructions for accelerating string processing. Quote from the article: The SSE 4.2 instruction set, first implemented in Intel's Core i7, provides string and text processing instructions (STTNI) that utilize SIMD operations for processing character data. Though originally conceived for accelerating string, text, and XML processing, the powerful new capabilities of these instructions are useful outside of these domains, and it is worth revisiting the

Realistic deadlock example in CUDA/OpenCL

阅读更多关于 Realistic deadlock example in CUDA/OpenCL

For a tutorial I'm writing, I'm looking for a "realistic" and simple example of a deadlock caused by ignorance of SIMT / SIMD. I came up with this snippet, which seems to be a good example. Any input would be appreciated. … int x = threadID / 2; if (threadID > x) { value[threadID] = 42; barrier(); } else { value2[threadID/2] = 13 barrier(); } result = value[threadID/2] + value2[threadID/2]; I know, it is neither proper CUDA C nor OpenCL C. A simple deadlock that is actually easy to catch by the novice CUDA programmer is when one tries to implement a critical section for a single thread, that

SSE: Difference between _mm_load/store vs. using direct pointer access

阅读更多关于 SSE: Difference between _mm_load/store vs. using direct pointer access

问题 Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that. The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that. void _add( uint16_t * dst, uint16_t const * src, size_t n ) { for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 ) { __m128i _s = _mm_load_si128( (

SIMD and difference between packed and scalar double precision

阅读更多关于 SIMD and difference between packed and scalar double precision

问题 I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below. __m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points. What does "packed" mean? Do I need to pack my float values somehow before I can use them? For double precision there are intrinsics like _mm_cmpeq_sd which means compare the "lower" double precision floating point elements. What does lower and

Slower SSE performance on large array sizes

阅读更多关于 Slower SSE performance on large array sizes

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){ xmm0 = _mm_loadu_si128(++src); accumulator = _mm_add_epi32

Generate vector code from Haskell?

阅读更多关于 Generate vector code from Haskell?

Is it possible to get GHC to produce SIMD code for the various SSE generations? Eg. got a program like this import Data.Array.Vector main = print . sumU $ (enumFromToFracU 1 10000000 :: UArr Double) I can see the generated code (compiled for 64 bit x86) use SSE instructions in scalar mode (both C and asm backends). So addsd rather than addpd. For the types of programs I work on the use of vector instructions is important for performance. Is there an easy way for a newbie such as myself to get GHC to SIMDize the code using SSE? Yes, it is possible, via the C backend, but it is trial and error.

Horizontal minimum and position in SSE for unsigned 32-bit integers

阅读更多关于 Horizontal minimum and position in SSE for unsigned 32-bit integers

问题 I am looking for a way to find the minimum and its position in SSE for unsigned 32-bit integers (similar to _mm_minpos_epu16). I know I can find the minimum through a series of _mm_min_epu32 and shuffles/shifts but that doesn't get me the position. Does anyone have any cool ways of doing this? 回答1: There is probably a cleverer method, but for now here's a brute force approach: #include <stdio.h> #include <smmintrin.h> // SSE4.1 int main(void) { __m128i v = _mm_setr_epi32(42, 1, 43, 2); printf

How do I convert _m128i to an unsigned int with SSE?

阅读更多关于 How do I convert _m128i to an unsigned int with SSE?

问题 I have made a function for posterizing images. // =( #define ARGB_COLOR(a, r, g, b) (((a) << 24) | ((r) << 16) | ((g) << 8) | (b)) inline UINT PosterizeColor(const UINT &color, const float &nColors) { __m128 clr = _mm_cvtepi32_ps( _mm_cvtepu8_epi32((__m128i&)color) ); clr = _mm_mul_ps(clr, _mm_set_ps1(nColors / 255.0f) ); clr = _mm_round_ps(clr, _MM_FROUND_TO_NEAREST_INT); clr = _mm_mul_ps(clr, _mm_set_ps1(255.0f / nColors) ); __m128i iClr = _mm_cvttps_epi32(clr); return ARGB_COLOR(iClr.m128i

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two phases, using the results of the other (U0 -> U1, then U1 -> U0, then U0 -> U1, etc). When it iterates, it