avx | 易学教程

Websocket data unmasking / multi byte xor

阅读更多关于 Websocket data unmasking / multi byte xor

问题 websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure

Horizontal trailing maximum on AVX or SSE

阅读更多关于 Horizontal trailing maximum on AVX or SSE

问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Does .NET Framework 4.5 provide SSE4/AVX support?

阅读更多关于 Does .NET Framework 4.5 provide SSE4/AVX support?

问题 I think, I heard about that, but don't know where. upd: I told about JiT 回答1: it seem that it is coming. (I just found out an hour ago) here a few links The JIT finally proposed. JIT and SIMD are getting married. Update to SIMD Support you need the latest version of RyuJIT and Microsoft SIMD-enabled Vector Types (Nuget) 回答2: No, there's no scenario in .NET where you can write machine code yourself. Code generation is entirely up to the just-in-time compiler. It is certainly capable of

Storing individual doubles from a packed double vector using Intel AVX

阅读更多关于 Storing individual doubles from a packed double vector using Intel AVX

问题 I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d ), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode: __m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3]; Using SSE, I

Simd matmul program gives different numerical results

阅读更多关于 Simd matmul program gives different numerical results

问题 I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients. REAL_T is just a float with typedef /* This is my matmul Version with simd, using floating simple precision*/ void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){ int i,j,k; __m256 vA, vB, vC, vRes; for (i=0; i<n; i++){ for (j=0; j<n; j++){ for (k=0; k<n; k= k+8){

Math functions takes more cycles after running any intel AVX function [duplicate]

阅读更多关于 Math functions takes more cycles after running any intel AVX function [duplicate]

问题 This question already has an answer here : Using AVX instructions disables exp() optimization? (1 answer) Closed 5 years ago . I've noticed that math functions (like ceil, round, ...) take more CPU cycles after running any intel AVX function. See following example: #include <stdio.h> #include <math.h> #include <immintrin.h> static unsigned long int get_rdtsc(void) { unsigned int a, d; asm volatile("rdtsc" : "=a" (a), "=d" (d)); return (((unsigned long int)a) | (((unsigned long int)d) << 32));

Is there, or will there be, a “global” version of the target_clones attribute?

阅读更多关于 Is there, or will there be, a “global” version of the target_clones attribute?

问题 I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared manually. This is less than optimal because: It puts compiler-specific stuff in the code. It requires the developer to identify which functions should receive this treatment. Let's take the example where I want to compile some code that will take

Matrix-vector-multiplication in AVX not proportionately faster than in SSE

阅读更多关于 Matrix-vector-multiplication in AVX not proportionately faster than in SSE

问题 I was writing a matrix-vector-multiplication in both SSE and AVX using the following: for(size_t i=0;i<M;i++) { size_t index = i*N; __m128 a, x, r1; __m128 sum = _mm_setzero_ps(); for(size_t j=0;j<N;j+=4,index+=4) { a = _mm_load_ps(&A[index]); x = _mm_load_ps(&X[j]); r1 = _mm_mul_ps(a,x); sum = _mm_add_ps(r1,sum); } sum = _mm_hadd_ps(sum,sum); sum = _mm_hadd_ps(sum,sum); _mm_store_ss(&C[i],sum); } I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

阅读更多关于 Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

问题 I understand it's important to use VZEROUPPER when mixing SSE and AVX code but what if I only use AVX (and ordinary x86-64 code) without using any legacy SSE instructions? If I never use a single SSE instruction in my code, is there any performance reason why I would ever need to use VZEROUPPER ? This is assuming I'm not calling into any external libraries (that might be using SSE). 回答1: You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers,

Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

阅读更多关于 Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

问题 Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call. Inputs: IN = ...1100010010010100... MASK = ...0001111010111011... Output: OUT = ...0001111010111000... edit: another example result from some comment discussion IN = ...11111110011010110... MASK = ...01011011001111110... Output: OUT = ...01011011001111110... I want to get the contiguous adjacent 1 bits of MASK