I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:
Here you go (untested), I’ve tried to explain in the comments what they do.
void func_sse41( const double* left, const double* right, double* res,
const size_t size, double th, double drop )
{
// Verify the size is even.
// If it's not, you'll need extra code at the end to process last value the old way.
assert( 0 == ( size % 2 ) );
// Load scalar values into 2 registers.
const __m128d threshold = _mm_set1_pd( th );
const __m128d dropVec = _mm_set1_pd( drop );
for( size_t i = 0; i < size; i += 2 )
{
// Load 4 double values into registers, 2 from right, 2 from left
const __m128d r = _mm_loadu_pd( right + i );
const __m128d l = _mm_loadu_pd( left + i );
// Compare ( r >= threshold ) for 2 values at once
const __m128d comp = _mm_cmpge_pd( r, threshold );
// Compute ( left[ i ] - drop ), for 2 values at once
const __m128d dropped = _mm_sub_pd( l, dropVec );
// Select either left or ( left - drop ) based on the comparison.
// This is the only instruction here that requires SSE 4.1.
const __m128d result = _mm_blendv_pd( l, dropped, comp );
// Store the 2 result values
_mm_storeu_pd( res, result );
}
}
The code will crash with “invalid instruction” runtime error if the CPU doesn’t have SSE 4.1. For best result, detect with CPU ID to fail gracefully. I think now in 2019 it’s quite reasonable to assume it’s supported, Intel did in 2008, AMD in 2011, steam survey says “96.3%”. If you want to support older CPUs, possible to emulate _mm_blendv_pd with 3 other instructions, _mm_and_pd, _mm_andnot_pd, _mm_or_pd.
If you can guarantee the data is aligned, replacing loads with _mm_load_pd will be slightly faster, _mm_cmpge_pd compiles into CMPPD https://www.felixcloutier.com/x86/cmppd which can take one of the arguments directly from RAM.
Potentially, you can get further 2x improvement by writing AVX version. But I hope even SSE version is faster than your code, it handles 2 values per iteration, and doesn’t have conditions inside the loop. If you’re unlucky, AVX will be slower, many CPUs need some time to power on their AVX units, takes many thousands of cycles. Until powered, AVX code runs very slowly.