sse | 易学教程

Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

阅读更多关于 Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

问题 I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms: Alg_A - No Vector support: #include <stdio.h> #define SIZE 1000000000 int main() { printf("Algorithm with NO vector support\n"); int a[] = {1, 2, 3, 4}; int b[] = {5, 6, 7, 8}; int i = 0; printf("Running loop %d times\n", SIZE); for (; i < SIZE; i++) { a[0] = a[0] + b

memset in parallel with threads bound to each physical core

阅读更多关于 memset in parallel with threads bound to each physical core

I have been testing the code at In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? and I'm observing something unexpected. My system is a single socket Xeon E5-1620 which is an Ivy Bridge processor with 4 physical cores and eight hyper-threads. I'm using Ubuntu 14.04 LTS, Linux Kernel 3.13, GCC 4.9.0, and EGLIBC 2.19. I compile with gcc -fopenmp -O3 mem.c When I run the code in the link it defaults to eight threads and gives Touch: 11830.448 MB/s Rewrite: 18133.428 MB/s However, when I bind the threads and set the number of threads to the number of

SSE: reciprocal if not zero

阅读更多关于 SSE: reciprocal if not zero

How can I take the reciprocal (inverse) of floats with SSE instructions, but only for non-zero values? Background bellow: I want to normalize an array of vectors so that each dimension has the same average. In C this can be coded as: float vectors[num * dim]; // input data // step 1. compute the sum on each dimension float norm[dim]; memset(norm, 0, dim * sizeof(float)); for(int i = 0; i < num; i++) for(int j = 0; j < dims; j++) norm[j] += vectors[i * dims + j]; // step 2. convert sums to reciprocal of average for(int j = 0; j < dims; j++) if(norm[j]) norm[j] = float(num) / norm[j]; // step 3.

Bypass delays when switching execution unit domains

阅读更多关于 Bypass delays when switching execution unit domains

I'm trying to understand possibly bypass delays when switching domains of execution units. For example, the following two lines of code give exactly the same result. _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); Which line of code is better to use? The assembly output for the first line gives: vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0 The assembly output for the second line gives: vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H vaddps xmm2, xmm1, XMMWORD PTR [rcx] Now if I look at Agner Fog's

How can I set __m128i without using of any SSE instruction?

阅读更多关于 How can I set __m128i without using of any SSE instruction?

I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization. So is it possible to initialize these constants without using of SSE?

Fast byte-wise replace if

阅读更多关于 Fast byte-wise replace if

I have a function that copies binary data from one area to another, but only if the bytes are different from a specific value. Here is a code sample: void copy_if(char* src, char* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { if (src[i] != ignore) dest[i] = src[i]; } } The problem is that this is too slow for my current need. Is there a way to obtain the same result in a faster way? Update: Based on answers I tried two new implementations: void copy_if_vectorized(const uint8_t* src, uint8_t* dest, size_t size, char ignore) { for (size_t i = 0; i < size; ++i) { char

Load constant floats into SSE registers

阅读更多关于 Load constant floats into SSE registers

I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this, const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f }; but that generates 4 movss instructions from memory! movss xmm0,dword ptr [__real@3f800000 (14048E534h)] movss xmm1,dword ptr [__real@40000000 (14048E530h)] movaps xmm6,xmm12 shufps xmm6,xmm12,0C6h movss dword ptr [rsp],xmm0 movss xmm0,dword ptr [__real@40400000 (14048E52Ch)] movss dword ptr [rsp+4],xmm1 movss xmm1,dword ptr [__real@40a00000 (14048E528h)] which load the scalars in and out of memory... (?!?!)

adding the components of an SSE register

阅读更多关于 adding the components of an SSE register

I want to add the four components of an SSE register to get a single float. This is how I do it now: float a[4]; _mm_storeu_ps(a, foo128); float x = a[0] + a[1] + a[2] + a[3]; Is there an SSE instruction that directly achieves this? You could probably use the HADDPS SSE3 instruction, or its compiler intrinsic _mm_hadd_ps , For example, see http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.80).aspx If you have two registers v1 and v2 : v = _mm_hadd_ps(v1, v2); v = _mm_hadd_ps(v, v); Now, v[0] contains the sum of v1's components, and v[1] contains the sum of v2's components. If you want your

How to control whether C math uses SSE2?

阅读更多关于 How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin . First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4 , otherwise calls __sin_default . __sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu. __sin_default (in

Detect the availability of SSE/SSE2 instruction set in Visual Studio

阅读更多关于 Detect the availability of SSE/SSE2 instruction set in Visual Studio

问题 How can I check in code whether SSE/SSE2 is enabled or not by the Visual Studio compiler? I have tried #ifdef __SSE__ but it didn't work. 回答1: From the documentation: _M_IX86_FP Expands to a value indicating which /arch compiler option was used: 0 if /arch:IA32 was used. 1 if /arch:SSE was used. 2 if /arch:SSE2 was used. This value is the default if /arch was not specified. I don't see any mention of _SSE_ . 回答2: Some additional information on _M_IX86_FP . _M_IX86_FP is only defined for 32