simd

Keep only the 10 useful bits in 16-bit words

烂漫一生 提交于 2021-02-15 06:11:50
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

Keep only the 10 useful bits in 16-bit words

半腔热情 提交于 2021-02-15 06:11:49
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

Keep only the 10 useful bits in 16-bit words

六月ゝ 毕业季﹏ 提交于 2021-02-15 06:11:18
问题 I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values? 回答1: Here’s my attempt. Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

assignment with intel Intrinsics - horizontal add

不羁的心 提交于 2021-02-11 15:14:19
问题 I want sum up all elements of a big vector ary . My idea was to do it with a horizontal sum. const int simd_width = 16/sizeof(float); float helper[simd_width]; //take the first 4 elements const __m128 a4 = _mm_load_ps(ary); for(int i=0; i<N-simd_width; i+=simd_width){ const __m128 b4 = _mm_load_ps(ary+i+simd_width); //save temporary result in helper array _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C const __m128 a4 = _mm_load_ps(helper); } I looked for a method, with which i can assign the

assignment with intel Intrinsics - horizontal add

青春壹個敷衍的年華 提交于 2021-02-11 15:13:06
问题 I want sum up all elements of a big vector ary . My idea was to do it with a horizontal sum. const int simd_width = 16/sizeof(float); float helper[simd_width]; //take the first 4 elements const __m128 a4 = _mm_load_ps(ary); for(int i=0; i<N-simd_width; i+=simd_width){ const __m128 b4 = _mm_load_ps(ary+i+simd_width); //save temporary result in helper array _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C const __m128 a4 = _mm_load_ps(helper); } I looked for a method, with which i can assign the

Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

我怕爱的太早我们不能终老 提交于 2021-02-10 04:13:39
问题 How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions? Round - roundf() Ceil - ceilf() or SSE4.1 _mm_ceil_ps . Floor - floorf() or SSE4.1 _mm_floor_ps . I need to do this without SSE4.1 roundps ( _mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) . roundps can also truncate toward zero, but I don't need that for this application. I can use SSE3 and earlier. (No SSSE3 or SSE4) So the function declaration would

Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

假装没事ソ 提交于 2021-02-10 04:13:30
问题 How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions? Round - roundf() Ceil - ceilf() or SSE4.1 _mm_ceil_ps . Floor - floorf() or SSE4.1 _mm_floor_ps . I need to do this without SSE4.1 roundps ( _mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) . roundps can also truncate toward zero, but I don't need that for this application. I can use SSE3 and earlier. (No SSSE3 or SSE4) So the function declaration would

Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

天大地大妈咪最大 提交于 2021-02-10 04:08:30
问题 How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions? Round - roundf() Ceil - ceilf() or SSE4.1 _mm_ceil_ps . Floor - floorf() or SSE4.1 _mm_floor_ps . I need to do this without SSE4.1 roundps ( _mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) . roundps can also truncate toward zero, but I don't need that for this application. I can use SSE3 and earlier. (No SSSE3 or SSE4) So the function declaration would

Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

牧云@^-^@ 提交于 2021-02-10 04:05:20
问题 How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions? Round - roundf() Ceil - ceilf() or SSE4.1 _mm_ceil_ps . Floor - floorf() or SSE4.1 _mm_floor_ps . I need to do this without SSE4.1 roundps ( _mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) . roundps can also truncate toward zero, but I don't need that for this application. I can use SSE3 and earlier. (No SSSE3 or SSE4) So the function declaration would

Convert “__m256 with random-bits” into float values of [0, 1] range

…衆ロ難τιáo~ 提交于 2021-02-08 19:53:53
问题 I have a __m256 value that holds random bits. I would like to to "interpret" it, to obtain another __m256 that holds float values in a uniform [0.0f, 1.0f] range. Planning to do it using: __m256 randomBits = /* generated random bits, uniformly distribution */; __m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision __m256 float01 = _mm256_mul(randomBits, invFloatRange); //float01 is now ready to be used Question 1: However, will