simd | 易学教程

SIMD optimization of a curve computed from the second derivative

阅读更多关于 SIMD optimization of a curve computed from the second derivative

问题 This question is really a curiosity. I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code: // args: uint32_t phase_current; uint32_t phase_increment; uint32_t phase_increment_step; for (int i = 0; i < blockSize; ++i) { USEFUL_FUNC(phase_current); phase_increment += phase_increment_step; phase_current += phase_increment; } The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply

SIMD optimization of a curve computed from the second derivative

阅读更多关于 SIMD optimization of a curve computed from the second derivative

SIMD optimization of a curve computed from the second derivative

阅读更多关于 SIMD optimization of a curve computed from the second derivative

left shift of 128 bit number using AVX2 instruction

阅读更多关于 left shift of 128 bit number using AVX2 instruction

问题 I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet of my code to do the same. l = 4; r = 4; targetrotate = _mm_set_epi64x (l, r); targetleftrotate = _mm_sllv_epi64 (target, targetrotate); The above c ode snippet rotates target by 4 to the left. When I tested the above code with a sample input, I could see the result is not rotated correctly. Here is

left shift of 128 bit number using AVX2 instruction

阅读更多关于 left shift of 128 bit number using AVX2 instruction

How to make premultiplied alpha function faster using SIMD instructions?

阅读更多关于 How to make premultiplied alpha function faster using SIMD instructions?

问题 I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = width * height * 4; i < max; i+=4) { data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255; data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255; data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255; } You will find below my current implementation but I think it could be much

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

阅读更多关于 AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

问题 I have two __m256i vectors, filled with 32 8-bit integers. Something like this: __int8 *a0 = new __int8[32] {2}; __int8 *a1 = new __int8[32] {3}; __m256i v0 = _mm256_loadu_si256((__m256i*)a0); __m256i v1 = _mm256_loadu_si256((__m256i*)a1); How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way? I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

阅读更多关于 AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

阅读更多关于 I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

阅读更多关于 I've some problems understanding how AVX shuffle intrinsics are working for 8 bits