sse

What happens with a non-temporal store if the data is already in cache?

筅森魡賤 提交于 2019-12-10 15:57:48
问题 When you use non-temporal stores, e.g. movntq, and the data is already in cache, will the store update the cache instead of writing out to memory? Or will it update the cache line and write it out, evicting it? Or what? Here's a fun dilemma. Suppose thread A is loading the cache line containing x and y. Thread B writes to x using a NT store. Thread A writes to y. There's a data race here if B's store to x can be in-transit to memory while A's load is happening. If A sees the old value of x,

Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?

99封情书 提交于 2019-12-10 14:48:22
问题 I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords. Is there a way of doing what I want, or if not, why is there none? 回答1: If you have SSE4.1 available, I don't think you can get better than using the pmaxud + psubd approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants. __m128i subs_epu32_sse4(__m128i a, __m128i b){ __m128i mx = _mm_max

sse/avx equivalent for neon vuzp

旧城冷巷雨未停 提交于 2019-12-10 14:44:19
问题 Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_* . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3) The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3

Extracting SSE shuffled 32 bit value with only SSE2

心已入冬 提交于 2019-12-10 13:49:09
问题 I am trying to extract 4 bytes out of a 128 bit register in efficient way. The problem is that each value is in a sperate 32bit {120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0} . I want to transform the 128 bit to 32 bit it the form {120,55,42,120} . The "raw" code looks like the following: __m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0}; unsigned char * byte_result_array=(unsigned char*)&byte_result_vec; result_array[x]=byte_result_array[0]; result_array[x+1]=byte_result_array[4];

For for an SSE vector that has all the same components, generate on the fly or precompute?

为君一笑 提交于 2019-12-10 13:34:24
问题 When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128 , and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector? I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble? Is the _mm

Need some constructive criticism on my SSE/Assembly attempt

雨燕双飞 提交于 2019-12-10 13:23:13
问题 I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code. The bit of code that I need to do this for is: float ox = p2x - (px * c - py * s)*m; float oy = p2y - (px * s - py * c)*m; What I've got for SSE code is: void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy) { vector4 r; __m128 scale = _mm_set1_ps(m); __asm { mov eax, p //Load into CPU reg mov ebx, sc movups xmm0, [eax] //move vectors to SSE regs

Optimzing SSE-code

房东的猫 提交于 2019-12-10 13:19:19
问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner

SSE matrix-matrix multiplication

让人想犯罪 __ 提交于 2019-12-10 11:30:34
问题 I'm having trouble doing matrix-matrix multiplication with SSE in C. Here is what I got so far: #define N 1000 void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) { int i, j, k; __m128i vA, vB, vR; for(i = 0; i < N; ++i) { for(j = 0; j < N; ++j) { vR = _mm_setzero_si128(); for(k = 0; k < N; k += 4) { //result[i][j] += mat1[i][k] * mat2[k][j]; vA = _mm_loadu_si128((__m128i*)&mat1[i][k]); vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it

Floating-point number vs fixed-point number: speed on Intel I5 CPU

£可爱£侵袭症+ 提交于 2019-12-10 09:22:25
问题 I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc. Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ? Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ? I

How do I perform 8 x 8 matrix operation using SSE?

泄露秘密 提交于 2019-12-10 04:13:39
问题 My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question