sse

Matrix-vector-multiplication in AVX not proportionately faster than in SSE

ぐ巨炮叔叔 提交于 2019-12-07 02:47:07
问题 I was writing a matrix-vector-multiplication in both SSE and AVX using the following: for(size_t i=0;i<M;i++) { size_t index = i*N; __m128 a, x, r1; __m128 sum = _mm_setzero_ps(); for(size_t j=0;j<N;j+=4,index+=4) { a = _mm_load_ps(&A[index]); x = _mm_load_ps(&X[j]); r1 = _mm_mul_ps(a,x); sum = _mm_add_ps(r1,sum); } sum = _mm_hadd_ps(sum,sum); sum = _mm_hadd_ps(sum,sum); _mm_store_ss(&C[i],sum); } I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent

Getting GCC to generate a PTEST instruction when using vector extensions

馋奶兔 提交于 2019-12-07 01:54:02
问题 When using the GCC vector extensions for C, how can I check that all the values on a vector are zero? For instance: #include <stdint.h> typedef uint32_t v8ui __attribute__ ((vector_size (32))); v8ui* foo(v8ui *mem) { v8ui v; for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7]; mem++) v &= *(mem); return mem; } SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC

Constant floats with SIMD

喜欢而已 提交于 2019-12-07 00:02:34
问题 I've been trying my hand at optimising some code I have using microsoft's sse intrinsics. One of the biggest problems when optimising my code is the LHS that happens whenever I want to use a constant. There seems to be some info on generating certain constants (here and here - section 13.4), but its all assembly (which I would rather avoid). The problem is when I try to implement the same thing with intrinsics, msvc complains about incompatible types etc. Does anyone know of any equivalent

Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

天涯浪子 提交于 2019-12-06 19:03:03
问题 Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call. Inputs: IN = ...1100010010010100... MASK = ...0001111010111011... Output: OUT = ...0001111010111000... edit: another example result from some comment discussion IN = ...11111110011010110... MASK = ...01011011001111110... Output: OUT = ...01011011001111110... I want to get the contiguous adjacent 1 bits of MASK

How to rewrite this code to sse intrinsics

折月煮酒 提交于 2019-12-06 16:06:32
Im new in sse intrinsics and would appreciate some hints assistance in using this 9as this is yet foggy to me) I got such code for(int k=0; k<=n-4; k+=4) { int xc0 = 512 + ((idx + k*iddx)>>6); int yc0 = 512 + ((idy + k*iddy)>>6); int xc1 = 512 + ((idx + (k+1)*iddx)>>6); int yc1 = 512 + ((idy + (k+1)*iddy)>>6); int xc2 = 512 + ((idx + (k+2)*iddx)>>6); int yc2 = 512 + ((idy + (k+2)*iddy)>>6); int xc3 = 512 + ((idx + (k+3)*iddx)>>6); int yc3 = 512 + ((idy + (k+3)*iddy)>>6); unsigned color0 = working_buffer[yc0*working_buffer_size_x + xc0]; unsigned color1 = working_buffer[yc1*working_buffer_size

ZeroMemory in SSE

六眼飞鱼酱① 提交于 2019-12-06 15:10:00
I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that. Is ZeroMemory() or memset() not good enough? Disclaimer: Some of the following may be SSE3. Fill any unaligned leading bytes by looping until the address is a multiple of 16 push to save an xmm reg pxor to zero the xmm reg While the remaining length >= 16, movdqa or movntdq to do the write pop to restore the xmm reg. Fill any unaligned trailing bytes. movntdq may appear to be faster because it tells the processor to not bring the data

SSE optimized code performs similar to plain version

帅比萌擦擦* 提交于 2019-12-06 13:30:26
I wanted to take my first steps with Intel's SSE so I followed the guide published here , with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign ). I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other. Is that normal? Could it be possible that GCC

The correct way to sum two arrays with SSE2 SIMD in C++

孤人 提交于 2019-12-06 12:31:27
问题 Let's start by including the following: #include <vector> #include <random> using namespace std; Now, suppose that one has the following three std:vector<float> : N = 1048576; vector<float> a(N); vector<float> b(N); vector<float> c(N); default_random_engine randomGenerator(time(0)); uniform_real_distribution<float> diceroll(0.0f, 1.0f); for(int i-0; i<N; i++) { a[i] = diceroll(randomGenerator); b[i] = diceroll(randomGenerator); } Now, assume that one needs to sum a and b element-wise and

How to avoid floating point round off error in unit tests?

爱⌒轻易说出口 提交于 2019-12-06 12:18:45
问题 I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow): Test Setup: static const int N = 1024;

some mandelbrot drawing routine from c to sse2

一世执手 提交于 2019-12-06 11:15:44
I want to rewrite such simple routine to SSE2 code, (preferably in nasm) and I am not totally sure how to do it, two things not clear (how to express calculations (inner loop and those from outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);" from under staticaly linked asm code void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER) { double ly = lx * double(CLIENT_Y)/double(CLIENT_X); double dx = lx / CLIENT_X; double dy = ly / CLIENT_Y; double ax = ox - lx * 0.5 + dx * 0.5; double ay = oy - ly * 0.5 + dy * 0.5; static double re, im, re_n, im_n, c