sse | 易学教程

What happens with a non-temporal store if the data is already in cache?

阅读更多关于 What happens with a non-temporal store if the data is already in cache?

问题 When you use non-temporal stores, e.g. movntq, and the data is already in cache, will the store update the cache instead of writing out to memory? Or will it update the cache line and write it out, evicting it? Or what? Here's a fun dilemma. Suppose thread A is loading the cache line containing x and y. Thread B writes to x using a NT store. Thread A writes to y. There's a data race here if B's store to x can be in-transit to memory while A's load is happening. If A sees the old value of x,

Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?

阅读更多关于 Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?

问题 I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords. Is there a way of doing what I want, or if not, why is there none? 回答1: If you have SSE4.1 available, I don't think you can get better than using the pmaxud + psubd approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants. __m128i subs_epu32_sse4(__m128i a, __m128i b){ __m128i mx = _mm_max

sse/avx equivalent for neon vuzp

阅读更多关于 sse/avx equivalent for neon vuzp

问题 Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_* . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3) The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3

Extracting SSE shuffled 32 bit value with only SSE2

阅读更多关于 Extracting SSE shuffled 32 bit value with only SSE2

问题 I am trying to extract 4 bytes out of a 128 bit register in efficient way. The problem is that each value is in a sperate 32bit {120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0} . I want to transform the 128 bit to 32 bit it the form {120,55,42,120} . The "raw" code looks like the following: __m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0}; unsigned char * byte_result_array=(unsigned char*)&byte_result_vec; result_array[x]=byte_result_array[0]; result_array[x+1]=byte_result_array[4];

For for an SSE vector that has all the same components, generate on the fly or precompute?

阅读更多关于 For for an SSE vector that has all the same components, generate on the fly or precompute?

问题 When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128 , and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector? I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble? Is the _mm

Need some constructive criticism on my SSE/Assembly attempt

阅读更多关于 Need some constructive criticism on my SSE/Assembly attempt

问题 I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code. The bit of code that I need to do this for is: float ox = p2x - (px * c - py * s)*m; float oy = p2y - (px * s - py * c)*m; What I've got for SSE code is: void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy) { vector4 r; __m128 scale = _mm_set1_ps(m); __asm { mov eax, p //Load into CPU reg mov ebx, sc movups xmm0, [eax] //move vectors to SSE regs

Optimzing SSE-code

阅读更多关于 Optimzing SSE-code

问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner

SSE matrix-matrix multiplication

阅读更多关于 SSE matrix-matrix multiplication

问题 I'm having trouble doing matrix-matrix multiplication with SSE in C. Here is what I got so far: #define N 1000 void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) { int i, j, k; __m128i vA, vB, vR; for(i = 0; i < N; ++i) { for(j = 0; j < N; ++j) { vR = _mm_setzero_si128(); for(k = 0; k < N; k += 4) { //result[i][j] += mat1[i][k] * mat2[k][j]; vA = _mm_loadu_si128((__m128i*)&mat1[i][k]); vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it

Floating-point number vs fixed-point number: speed on Intel I5 CPU

阅读更多关于 Floating-point number vs fixed-point number: speed on Intel I5 CPU

问题 I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc. Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ? Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ? I

How do I perform 8 x 8 matrix operation using SSE?

阅读更多关于 How do I perform 8 x 8 matrix operation using SSE?

问题 My initial attempt looked like this (supposed we want to multiply) __m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } But this clearly doesn't work. How do I approach this? I should load 4 at a time.... The other question