intrinsics | 易学教程

AVX2 byte gather with uint16 indices, into a __m256i

阅读更多关于 AVX2 byte gather with uint16 indices, into a __m256i

问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX2 byte gather with uint16 indices, into a __m256i

阅读更多关于 AVX2 byte gather with uint16 indices, into a __m256i

AVX/SSE round floats down and return vector of ints?

阅读更多关于 AVX/SSE round floats down and return vector of ints?

问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

AVX/SSE round floats down and return vector of ints?

阅读更多关于 AVX/SSE round floats down and return vector of ints?

Matrix-Vector and Matrix-Matrix multiplication using SSE

阅读更多关于 Matrix-Vector and Matrix-Matrix multiplication using SSE

问题 I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands. The dimensions of matrices and vectors are always multiples of 4. I managed to write the vector-vector multiplication function that looks like this: void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size) { int i; __declspec(align(16))__m128 *p_m = (__m128*)m; __declspec(align(16))__m128 *p_n = (__m128*)n; __declspec(align(16))__m128 *p

Use both SSE2 intrinsics and gcc inline assembler

阅读更多关于 Use both SSE2 intrinsics and gcc inline assembler

问题 I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example: #include <emmintrin.h> int main() { __m128i test = _mm_setzero_si128(); asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : ); } When compiled with gcc version 4.6.1 I get: >gcc asm_xmm.c asm_xmm.c: In function ‘main’: asm_xmm.c:10:3: error: matching constraint references invalid operand number asm_xmm.c:7:5: error: matching

Truth-table reduction to ternary logic operations, vpternlog

阅读更多关于 Truth-table reduction to ternary logic operations, vpternlog

问题 I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg _mm_xor_epi32) which works fine. Question : with vpternlog I can make ternary logic operations. But I'm not aware of a method to simplify my truth-tables to sequences of vpternlog instructions that are (somewhat) efficient. I'm not asking if someone knows a

Conditional SSE/AVX add or zero elements based on compare

阅读更多关于 Conditional SSE/AVX add or zero elements based on compare

问题 I have the following __m128 vectors: v_weight v_entropy I need to add v_entropy to v_weight only where elements in v_weight are not 0f. Obviously _mm_add_ps() adds all elements regardless. I can compile up to AVX, but not AVX2. EDIT I do know beforehand how many elements in v_weight will be 0 (there will always be either 0 or the last 1, 2, or 3 elements). If it's easier, how do I zero-out the corresponding elements in v_entropy ? 回答1: The cmpeq/cmpgt instructions create a mask, all ones or

SIMD: Accumulate Adjacent Pairs

阅读更多关于 SIMD: Accumulate Adjacent Pairs

问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

AVX intrinsics for tiled matrix multiplication [closed]

阅读更多关于 AVX intrinsics for tiled matrix multiplication [closed]

问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled). I used __mm256d as variables to store intermediate results and store them in my results. However, somehow this triggers memory corruption. I've got no hint why this is the case, as the non