Fast percentile in C++ | 易学教程

问题

My program calculates a Monte Carlo simulation for the value-at-risk metric. To simplify as much as possible, I have:

1/ simulated daily cashflows
2/ to get a sample of a possible 1-year cashflow, 
   I need to draw 365 random daily cashflows and sum them

Hence, the daily cashflows are an empirically given distrobution function to be sampled 365 times. For this, I

 1/ sort the daily cashflows into an array called *this->distro*
 2/ calculate 365 percentiles corresponding to random probabilities

I need to do this simulation of a yearly cashflow, say, 10K times to get a population of simulated yearly cashflows to work with. Having the distribution function of daily cashflows prepared, I do the sampling like...

for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
    generatedVal = 0.0;
    for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
    {
        prob = (FLT_TYPE)fastrand();         // prob [0,1]
        dIdx = prob * dMaxDistroIndex;       // scale prob to distro function size
                                             // to get an index into distro array
        _floor = ((FLT_TYPE)(long)dIdx);     // fast version of floor
        _ceil  = _floor + 1.0f;              // 'fast' ceil:)
        iIdx1  = (unsigned int)( _floor );
        iIdx2  = iIdx1 + 1;

        // interpolation per se
        generatedVal += this->distro[iIdx1]*(_ceil - dIdx  );
        generatedVal += this->distro[iIdx2]*(dIdx  - _floor);
    }
    this->yearlyCashflows[idxSim] = generatedVal ;
}

The code inside of both for cycles does linear interpolation. If, say USD 1000 corresponds to prob=0.01, USD 10000 corresponds to prob=0.1 then if I don't have an empipirical number for p=0.05 I want to get USD 5000 by interpolation.

The question: this code runs correctly, though the profiler says that the program spends cca 60% of its runtime on the interpolation per se. So my question is, how can I make this task faster? Sample runtimes reported by VTune for specific lines are as follows:

prob = (FLT_TYPE)fastrand();         //  0.727s
dIdx = prob * dMaxDistroIndex;       //  1.435s
_floor = ((FLT_TYPE)(long)dIdx);     //  0.718s
_ceil  = _floor + 1.0f;              //    -

iIdx1  = (unsigned int)( _floor );   // 4.949s
iIdx2  = iIdx1 + 1;                  //    -

// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx  );  //    -
generatedVal += this->distro[iIdx2]*(dIdx  - _floor);  // 12.704s

Dashes mean the profiler reports no runtimes for those lines.

Any hint will be greatly appreciated. Daniel

EDIT: Both c.fogelklou and MSalters have pointed out great enhancements. The best code in line with what c.fogelklou said is

converter = distroDimension / (FLT_TYPE)(RAND_MAX + 1)
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
    generatedVal = 0.0;
    for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
    {
        dIdx  = (FLT_TYPE)fastrand() * converter;
        iIdx1 = (unsigned long)dIdx);
        _floor = (FLT_TYPE)iIdx1;
        generatedVal += this->distro[iIdx1] + this->diffs[iIdx1] *(dIdx  - _floor);
    }
}

and the best I have along MSalter's lines is

normalizer = 1.0/(FLT_TYPE)(RAND_MAX + 1);
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
    generatedVal = 0.0;
    for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
    {
        dIdx  = (FLT_TYPE)fastrand()* normalizer ;
        iIdx1 = fastrand() % _g.xDayCount;
        generatedVal += this->distro[iIdx1];
        generatedVal += this->diffs[iIdx1]*dIdx;
    }
}

The second code is approx. 30 percent faster. Now, of 95s of total runtime, the last line consumes 68s. The last but one line consumes only 3.2s hence the double*double multiplication must be the devil. I thought of SSE - saving the last three operands into an array and then carry out a vector multiplication of this->diffs[i]*dIdx[i] and add this to this->distro[i] but this code ran 50 percent slower. Hence, I think I hit the wall.

Many thanks to all. D.

回答1:

This is a proposal for a small optimization, removing the need for ceil, two casts, and one of the multiplies. If you are running on a fixed point processor, that would explain why the multiplies and casts between float and int are taking so long. In that case, try using fixed point optimizations or turning on floating point in your compiler if the CPU supports it!

for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
    generatedVal = 0.0;
    for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
    {
        prob = (FLT_TYPE)fastrand();         // prob [0,1]
        dIdx = prob * dMaxDistroIndex;       // scale prob to distro function size
                                             // to get an index into distro array
        iIdx1  = (long)dIdx;
        _floor = (FLT_TYPE)iIdx1;     // fast version of floor
        iIdx2  = iIdx1 + 1;

        // interpolation per se
        {
           const FLT_TYPE diff = this->distro[iIdx2] - this->distro[iIdx1];
           const FLT_TYPE interp = this->distro[iIdx1] + diff * (dIdx - _floor);
           generatedVal += interp;
        }
    }
    this->yearlyCashflows[idxSim] = generatedVal ;
}

回答2:

I would recommend to fix fastrand. Floating-point code isn't the fastest in the world, but what is especially slow is the switching between floating point and integer code. Since you need an integer index, use an integer random function.

It may even be advantageous to pre-generate all 365 random values in a loop. Since you need only log2(dMaxDistroIndex) bits of randomness per value, you may be able to reduce the number of RNG calls.

You would subsequently pick a random number between 0 and 1 for the interpolation fraction.

来源：https://stackoverflow.com/questions/14890079/fast-percentile-in-c

标签

c++

performance

percentile