avx

Websocket data unmasking / multi byte xor

泪湿孤枕 提交于 2019-12-07 13:38:26
问题 websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure

Horizontal trailing maximum on AVX or SSE

安稳与你 提交于 2019-12-07 07:07:03
问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Does .NET Framework 4.5 provide SSE4/AVX support?

…衆ロ難τιáo~ 提交于 2019-12-07 06:30:35
问题 I think, I heard about that, but don't know where. upd: I told about JiT 回答1: it seem that it is coming. (I just found out an hour ago) here a few links The JIT finally proposed. JIT and SIMD are getting married. Update to SIMD Support you need the latest version of RyuJIT and Microsoft SIMD-enabled Vector Types (Nuget) 回答2: No, there's no scenario in .NET where you can write machine code yourself. Code generation is entirely up to the just-in-time compiler. It is certainly capable of

Storing individual doubles from a packed double vector using Intel AVX

空扰寡人 提交于 2019-12-07 05:59:28
问题 I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d ), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode: __m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3]; Using SSE, I

Simd matmul program gives different numerical results

对着背影说爱祢 提交于 2019-12-07 05:22:16
问题 I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients. REAL_T is just a float with typedef /* This is my matmul Version with simd, using floating simple precision*/ void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){ int i,j,k; __m256 vA, vB, vC, vRes; for (i=0; i<n; i++){ for (j=0; j<n; j++){ for (k=0; k<n; k= k+8){

Math functions takes more cycles after running any intel AVX function [duplicate]

一个人想着一个人 提交于 2019-12-07 03:19:38
问题 This question already has an answer here : Using AVX instructions disables exp() optimization? (1 answer) Closed 5 years ago . I've noticed that math functions (like ceil, round, ...) take more CPU cycles after running any intel AVX function. See following example: #include <stdio.h> #include <math.h> #include <immintrin.h> static unsigned long int get_rdtsc(void) { unsigned int a, d; asm volatile("rdtsc" : "=a" (a), "=d" (d)); return (((unsigned long int)a) | (((unsigned long int)d) << 32));

Is there, or will there be, a “global” version of the target_clones attribute?

末鹿安然 提交于 2019-12-07 02:57:39
问题 I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared manually. This is less than optimal because: It puts compiler-specific stuff in the code. It requires the developer to identify which functions should receive this treatment. Let's take the example where I want to compile some code that will take

Matrix-vector-multiplication in AVX not proportionately faster than in SSE

ぐ巨炮叔叔 提交于 2019-12-07 02:47:07
问题 I was writing a matrix-vector-multiplication in both SSE and AVX using the following: for(size_t i=0;i<M;i++) { size_t index = i*N; __m128 a, x, r1; __m128 sum = _mm_setzero_ps(); for(size_t j=0;j<N;j+=4,index+=4) { a = _mm_load_ps(&A[index]); x = _mm_load_ps(&X[j]); r1 = _mm_mul_ps(a,x); sum = _mm_add_ps(r1,sum); } sum = _mm_hadd_ps(sum,sum); sum = _mm_hadd_ps(sum,sum); _mm_store_ss(&C[i],sum); } I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

徘徊边缘 提交于 2019-12-06 22:02:18
问题 I understand it's important to use VZEROUPPER when mixing SSE and AVX code but what if I only use AVX (and ordinary x86-64 code) without using any legacy SSE instructions? If I never use a single SSE instruction in my code, is there any performance reason why I would ever need to use VZEROUPPER ? This is assuming I'm not calling into any external libraries (that might be using SSE). 回答1: You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers,

Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

天涯浪子 提交于 2019-12-06 19:03:03
问题 Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call. Inputs: IN = ...1100010010010100... MASK = ...0001111010111011... Output: OUT = ...0001111010111000... edit: another example result from some comment discussion IN = ...11111110011010110... MASK = ...01011011001111110... Output: OUT = ...01011011001111110... I want to get the contiguous adjacent 1 bits of MASK