intrinsics

AVX2 byte gather with uint16 indices, into a __m256i

冷暖自知 提交于 2021-02-07 13:30:20
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX2 byte gather with uint16 indices, into a __m256i

走远了吗. 提交于 2021-02-07 13:28:26
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX/SSE round floats down and return vector of ints?

拟墨画扇 提交于 2021-02-07 08:20:53
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

AVX/SSE round floats down and return vector of ints?

▼魔方 西西 提交于 2021-02-07 08:17:27
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

Matrix-Vector and Matrix-Matrix multiplication using SSE

懵懂的女人 提交于 2021-02-07 04:28:19
问题 I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands. The dimensions of matrices and vectors are always multiples of 4. I managed to write the vector-vector multiplication function that looks like this: void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size) { int i; __declspec(align(16))__m128 *p_m = (__m128*)m; __declspec(align(16))__m128 *p_n = (__m128*)n; __declspec(align(16))__m128 *p

Use both SSE2 intrinsics and gcc inline assembler

北城以北 提交于 2021-02-07 02:49:32
问题 I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example: #include <emmintrin.h> int main() { __m128i test = _mm_setzero_si128(); asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : ); } When compiled with gcc version 4.6.1 I get: >gcc asm_xmm.c asm_xmm.c: In function ‘main’: asm_xmm.c:10:3: error: matching constraint references invalid operand number asm_xmm.c:7:5: error: matching

Truth-table reduction to ternary logic operations, vpternlog

对着背影说爱祢 提交于 2021-02-06 10:52:12
问题 I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg _mm_xor_epi32) which works fine. Question : with vpternlog I can make ternary logic operations. But I'm not aware of a method to simplify my truth-tables to sequences of vpternlog instructions that are (somewhat) efficient. I'm not asking if someone knows a

Conditional SSE/AVX add or zero elements based on compare

此生再无相见时 提交于 2021-02-04 21:40:18
问题 I have the following __m128 vectors: v_weight v_entropy I need to add v_entropy to v_weight only where elements in v_weight are not 0f. Obviously _mm_add_ps() adds all elements regardless. I can compile up to AVX, but not AVX2. EDIT I do know beforehand how many elements in v_weight will be 0 (there will always be either 0 or the last 1, 2, or 3 elements). If it's easier, how do I zero-out the corresponding elements in v_entropy ? 回答1: The cmpeq/cmpgt instructions create a mask, all ones or

SIMD: Accumulate Adjacent Pairs

青春壹個敷衍的年華 提交于 2021-02-02 09:29:36
问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

AVX intrinsics for tiled matrix multiplication [closed]

我怕爱的太早我们不能终老 提交于 2021-01-29 13:18:23
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled). I used __mm256d as variables to store intermediate results and store them in my results. However, somehow this triggers memory corruption. I've got no hint why this is the case, as the non