avx2

Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

与世无争的帅哥 提交于 2021-02-16 15:42:45
问题 I current run BOINC across a number of servers which have GPUs. The servers run both GPU and CPU BOINC apps. As AVX and SSE slow down the CPU freq when being used within a CPU app, I have to be selective which CPU/GPU I run together, as some GPU apps get bottle necked (slower run time completion) where as others do not. At present some CPU apps are named so it is clear to see if they use AVX but most are not. Therefore is there any command I can run, and some way of viewing, to see if any of

Keep only the 10 useful bits in 16-bit words

烂漫一生 提交于 2021-02-15 06:11:50
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

Keep only the 10 useful bits in 16-bit words

半腔热情 提交于 2021-02-15 06:11:49
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

Keep only the 10 useful bits in 16-bit words

六月ゝ 毕业季﹏ 提交于 2021-02-15 06:11:18
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

Gather AVX2&512 intrinsic for 16-bit integers?

五迷三道 提交于 2021-02-11 12:15:20
问题 Imagine this piece of code: void Function(int16 *src, int *indices, float *dst, int cnt, float mul) { for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul; }; This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output. 回答1: There is indeed no instruction to gather 16bit integers, but

Convert “__m256 with random-bits” into float values of [0, 1] range

…衆ロ難τιáo~ 提交于 2021-02-08 19:53:53
问题 I have a __m256 value that holds random bits. I would like to to "interpret" it, to obtain another __m256 that holds float values in a uniform [0.0f, 1.0f] range. Planning to do it using: __m256 randomBits = /* generated random bits, uniformly distribution */; __m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision __m256 float01 = _mm256_mul(randomBits, invFloatRange); //float01 is now ready to be used Question 1: However, will

left shift of 128 bit number using AVX2 instruction

我的梦境 提交于 2021-02-08 07:21:22
问题 I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet of my code to do the same. l = 4; r = 4; targetrotate = _mm_set_epi64x (l, r); targetleftrotate = _mm_sllv_epi64 (target, targetrotate); The above c ode snippet rotates target by 4 to the left. When I tested the above code with a sample input, I could see the result is not rotated correctly. Here is

left shift of 128 bit number using AVX2 instruction

血红的双手。 提交于 2021-02-08 07:21:14
问题 I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet of my code to do the same. l = 4; r = 4; targetrotate = _mm_set_epi64x (l, r); targetleftrotate = _mm_sllv_epi64 (target, targetrotate); The above c ode snippet rotates target by 4 to the left. When I tested the above code with a sample input, I could see the result is not rotated correctly. Here is

AVX2 byte gather with uint16 indices, into a __m256i

冷暖自知 提交于 2021-02-07 13:30:20
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX2 byte gather with uint16 indices, into a __m256i

走远了吗. 提交于 2021-02-07 13:28:26
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],