I have a short to float cast in C++ that is bottlenecking my code.
The code translates from a hardware device buffer which is natively shorts, this represents the in
and You can use OpenMP to hire every core of your CPU, and it is simple just do as following:
#include
float factor= 1.0f/value;
#pragma omp parallel for
for (int i = 0; i < W*H; i++)//25% of time is spent doing this
{
int value = source[i];//ushort -> int
destination[i] = value*factor;//int*float->float
}
here is the result based on previous program, just add the like this:
#pragma omp parallel for
for (int it = 0; it < iterations; it++){
...
}
and then here is the result
beta@beta-PC ~
$ g++ -o opt.exe opt.c -msse4.1 -fopenmp
beta@beta-PC ~
$ opt
0.748
2.90873e+007
0.484
2.90873e+007
0.796
2.90873e+007
beta@beta-PC ~
$ g++ -o opt.exe opt.c -msse4.1 -O3
beta@beta-PC ~
$ opt
1.404
2.90873e+007
1.404
2.90873e+007
1.404
2.90873e+007
. .
result shows 100% improvment with openmp. Visual C++ supports openmp too.