sse

Automatic vectorization of matrix multiplication

↘锁芯ラ 提交于 2019-12-12 10:19:12
问题 I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me. So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor) I essentially have this function: /* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */ void mmul(double **m1, double **m2, double **m3, int N, int M, int P) { for (i = 0; i < N; i++) for (j = 0; j < P; j++) { double tmp = 0.0; for (k = 0; k

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

牧云@^-^@ 提交于 2019-12-12 09:53:19
问题 I have two unsigned vectors, both with size 4 vector<unsigned> v1 = {2, 4, 6, 8} vector<unsigned> v2 = {1, 10, 11, 13} Now I want to multiply these two vectors and get a new one vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13} What is the SSE operation to use? Is it cross platform or only in some specified platforms? Adding: If my goal is adding not multiplication, I can do this super fast: __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); __m128i c; c = _mm_add_epi32

memset in parallel with threads bound to each physical core

吃可爱长大的小学妹 提交于 2019-12-12 08:44:36
问题 I have been testing the code at In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? and I'm observing something unexpected. My system is a single socket Xeon E5-1620 which is an Ivy Bridge processor with 4 physical cores and eight hyper-threads. I'm using Ubuntu 14.04 LTS, Linux Kernel 3.13, GCC 4.9.0, and EGLIBC 2.19. I compile with gcc -fopenmp -O3 mem.c When I run the code in the link it defaults to eight threads and gives Touch: 11830.448 MB/s Rewrite:

Load constant floats into SSE registers

柔情痞子 提交于 2019-12-12 08:28:02
问题 I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this, const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f }; but that generates 4 movss instructions from memory! movss xmm0,dword ptr [__real@3f800000 (14048E534h)] movss xmm1,dword ptr [__real@40000000 (14048E530h)] movaps xmm6,xmm12 shufps xmm6,xmm12,0C6h movss dword ptr [rsp],xmm0 movss xmm0,dword ptr [__real@40400000 (14048E52Ch)] movss dword ptr [rsp+4],xmm1

Fastest 50% scaling of (A)RGB32 images using sse intrinsics

你。 提交于 2019-12-12 08:12:35
问题 I want to scale down images as fast as I can in c++. This article describes how to efficiently average 32bit rgb images down 50%. It is fast and looks good. I have tried modifying that approach using sse intrinsics. The code below works, with or without SSE enabled. Surprisingly, though, the speedup is negligible. Can anybody see a way of improving the SSE code. The two lines creating vars shuffle1 and shuffle2 seems two be candidates(using some clever shifting or similar). /* * Calculates

What is my compiler doing? (optimizing memcpy)

断了今生、忘了曾经 提交于 2019-12-12 07:50:36
问题 I'm compiling a bit of code using the following settings in VC++2010: /O2 /Ob2 /Oi /Ot However I'm having some trouble understanding some parts of the assembly generated, I have put some questions in the code as comments. Also, what prefetching distance is generally recommended on modern cpus? I can ofc test on my own cpu, but I was hoping for some value that will work well on a wider range of cpus. Maybe one could use dynamic prefetching distances? <--EDIT: Another thing I'm surprised about

how to deinterleave image channel in SSE

六眼飞鱼酱① 提交于 2019-12-12 05:11:53
问题 is there any way we can DE-interleave 32bpp image channels similar as below code in neon. //Read all r,g,b,a pixels into 4 registers uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32); ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high); basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very

SSE byte and half word swapping

拥有回忆 提交于 2019-12-12 01:22:50
问题 I would like to translate this code using SSE intrinsics. for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4) { uint32_t value = *(uint32_t*)src; *(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16); } Is anyone aware of an intrinsic to perform the 16-bit word swapping? 回答1: pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap. stealing Paul R's function

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

核能气质少年 提交于 2019-12-11 17:01:04
问题 Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help. Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation. Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads? Do CPU manufactures demand certain size of the memory bus, example,

Substitute a byte with another one

谁说胖子不能爱 提交于 2019-12-11 15:39:48
问题 I am finding difficulties in creating a code for this seemingly easy problem. Given a packed 8 bits integer, substitute one byte with another if present. For instance, I want to substitute 0x06 with 0x01 , so I can do the following with res as the input to find 0x06 : // Bytes to be manipulated res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06); // Target value and substitution val = _mm_set1_epi8(0x06); sub = _mm_set1_epi8(0x01)