simd

SIMD optimization of a curve computed from the second derivative

妖精的绣舞 提交于 2021-02-08 19:21:51
问题 This question is really a curiosity. I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code: // args: uint32_t phase_current; uint32_t phase_increment; uint32_t phase_increment_step; for (int i = 0; i < blockSize; ++i) { USEFUL_FUNC(phase_current); phase_increment += phase_increment_step; phase_current += phase_increment; } The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply

SIMD optimization of a curve computed from the second derivative

青春壹個敷衍的年華 提交于 2021-02-08 19:10:14
问题 This question is really a curiosity. I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code: // args: uint32_t phase_current; uint32_t phase_increment; uint32_t phase_increment_step; for (int i = 0; i < blockSize; ++i) { USEFUL_FUNC(phase_current); phase_increment += phase_increment_step; phase_current += phase_increment; } The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply

SIMD optimization of a curve computed from the second derivative

大憨熊 提交于 2021-02-08 19:06:50
问题 This question is really a curiosity. I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code: // args: uint32_t phase_current; uint32_t phase_increment; uint32_t phase_increment_step; for (int i = 0; i < blockSize; ++i) { USEFUL_FUNC(phase_current); phase_increment += phase_increment_step; phase_current += phase_increment; } The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply

left shift of 128 bit number using AVX2 instruction

我的梦境 提交于 2021-02-08 07:21:22
问题 I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet of my code to do the same. l = 4; r = 4; targetrotate = _mm_set_epi64x (l, r); targetleftrotate = _mm_sllv_epi64 (target, targetrotate); The above c ode snippet rotates target by 4 to the left. When I tested the above code with a sample input, I could see the result is not rotated correctly. Here is

left shift of 128 bit number using AVX2 instruction

血红的双手。 提交于 2021-02-08 07:21:14
问题 I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet of my code to do the same. l = 4; r = 4; targetrotate = _mm_set_epi64x (l, r); targetleftrotate = _mm_sllv_epi64 (target, targetrotate); The above c ode snippet rotates target by 4 to the left. When I tested the above code with a sample input, I could see the result is not rotated correctly. Here is

How to make premultiplied alpha function faster using SIMD instructions?

↘锁芯ラ 提交于 2021-02-07 06:38:12
问题 I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = width * height * 4; i < max; i+=4) { data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255; data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255; data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255; } You will find below my current implementation but I think it could be much

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

隐身守侯 提交于 2021-02-07 03:44:08
问题 I have two __m256i vectors, filled with 32 8-bit integers. Something like this: __int8 *a0 = new __int8[32] {2}; __int8 *a1 = new __int8[32] {3}; __m256i v0 = _mm256_loadu_si256((__m256i*)a0); __m256i v1 = _mm256_loadu_si256((__m256i*)a1); How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way? I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

空扰寡人 提交于 2021-02-07 03:43:17
问题 I have two __m256i vectors, filled with 32 8-bit integers. Something like this: __int8 *a0 = new __int8[32] {2}; __int8 *a1 = new __int8[32] {3}; __m256i v0 = _mm256_loadu_si256((__m256i*)a0); __m256i v1 = _mm256_loadu_si256((__m256i*)a1); How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way? I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

倾然丶 夕夏残阳落幕 提交于 2021-02-05 11:51:07
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

烈酒焚心 提交于 2021-02-05 11:48:05
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting