sse | 易学教程

Automatic vectorization of matrix multiplication

阅读更多关于 Automatic vectorization of matrix multiplication

问题 I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me. So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor) I essentially have this function: /* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */ void mmul(double **m1, double **m2, double **m3, int N, int M, int P) { for (i = 0; i < N; i++) for (j = 0; j < P; j++) { double tmp = 0.0; for (k = 0; k

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

阅读更多关于 Fastest way to multiply two vectors of 32bit integers in C++, with SSE

问题 I have two unsigned vectors, both with size 4 vector<unsigned> v1 = {2, 4, 6, 8} vector<unsigned> v2 = {1, 10, 11, 13} Now I want to multiply these two vectors and get a new one vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13} What is the SSE operation to use? Is it cross platform or only in some specified platforms? Adding: If my goal is adding not multiplication, I can do this super fast: __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); __m128i c; c = _mm_add_epi32

memset in parallel with threads bound to each physical core

阅读更多关于 memset in parallel with threads bound to each physical core

问题 I have been testing the code at In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? and I'm observing something unexpected. My system is a single socket Xeon E5-1620 which is an Ivy Bridge processor with 4 physical cores and eight hyper-threads. I'm using Ubuntu 14.04 LTS, Linux Kernel 3.13, GCC 4.9.0, and EGLIBC 2.19. I compile with gcc -fopenmp -O3 mem.c When I run the code in the link it defaults to eight threads and gives Touch: 11830.448 MB/s Rewrite:

Load constant floats into SSE registers

阅读更多关于 Load constant floats into SSE registers

问题 I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this, const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f }; but that generates 4 movss instructions from memory! movss xmm0,dword ptr [__real@3f800000 (14048E534h)] movss xmm1,dword ptr [__real@40000000 (14048E530h)] movaps xmm6,xmm12 shufps xmm6,xmm12,0C6h movss dword ptr [rsp],xmm0 movss xmm0,dword ptr [__real@40400000 (14048E52Ch)] movss dword ptr [rsp+4],xmm1

Fastest 50% scaling of (A)RGB32 images using sse intrinsics

阅读更多关于 Fastest 50% scaling of (A)RGB32 images using sse intrinsics

问题 I want to scale down images as fast as I can in c++. This article describes how to efficiently average 32bit rgb images down 50%. It is fast and looks good. I have tried modifying that approach using sse intrinsics. The code below works, with or without SSE enabled. Surprisingly, though, the speedup is negligible. Can anybody see a way of improving the SSE code. The two lines creating vars shuffle1 and shuffle2 seems two be candidates(using some clever shifting or similar). /* * Calculates

What is my compiler doing? (optimizing memcpy)

阅读更多关于 What is my compiler doing? (optimizing memcpy)

问题 I'm compiling a bit of code using the following settings in VC++2010: /O2 /Ob2 /Oi /Ot However I'm having some trouble understanding some parts of the assembly generated, I have put some questions in the code as comments. Also, what prefetching distance is generally recommended on modern cpus? I can ofc test on my own cpu, but I was hoping for some value that will work well on a wider range of cpus. Maybe one could use dynamic prefetching distances? <--EDIT: Another thing I'm surprised about

how to deinterleave image channel in SSE

阅读更多关于 how to deinterleave image channel in SSE

问题 is there any way we can DE-interleave 32bpp image channels similar as below code in neon. //Read all r,g,b,a pixels into 4 registers uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32); ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high); basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very

SSE byte and half word swapping

阅读更多关于 SSE byte and half word swapping

问题 I would like to translate this code using SSE intrinsics. for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4) { uint32_t value = *(uint32_t*)src; *(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16); } Is anyone aware of an intrinsic to perform the 16-bit word swapping? 回答1: pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap. stealing Paul R's function

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

阅读更多关于 SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

问题 Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help. Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation. Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads? Do CPU manufactures demand certain size of the memory bus, example,

Substitute a byte with another one

阅读更多关于 Substitute a byte with another one

问题 I am finding difficulties in creating a code for this seemingly easy problem. Given a packed 8 bits integer, substitute one byte with another if present. For instance, I want to substitute 0x06 with 0x01 , so I can do the following with res as the input to find 0x06 : // Bytes to be manipulated res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06); // Target value and substitution val = _mm_set1_epi8(0x06); sub = _mm_set1_epi8(0x01)