Caching in a high-performance financial application

问题

I am writing an application whose purpose is to optimize a trading strategy. For the sake of simplicity, assume only that we have a trading strategy that says "enter here", then another that says "exit here if in a trade" and then lets have two models: one says how much risk we should take (how much we lose if we're on the wrong side of the market) and the other says how much profit we should take (i.e. how much profit we will take if the market agrees).

For simplicity sake, I will refer to historical realized trades as ticks. That means if I "enter on tick 28" this means I would have entered a trade in the time of 28th trade in my dataset at the price of this trade. Ticks are stored chronologically in my dataset.

Now, imagine the entry strategy on the whole dataset comes up with 500 entries. For each entry, I can precalculate the exact entry tick. I can also calculate the exit points determined by the exit strategy for each entry point (again as tick numbers). For each entry, I can also precalculate the modeled loss and profit and the ticks where these losses or profits would have been hit. The last thing that remains to be done is calculating what would have happenned first, i.e. exit on strategy, exit on a loss or exit on a profit.

Hence, I iterate through the array of trades and calculate exitTick[i] = min(exitTickByStrat[i], exitTickByLoss[i], exitTickByProfit[i]). And the whole process is bloody slow (let's say I do this 100M times). I suspect cache misses are the main culprit. And the question is: can this be made faster somehow? I have to iterate through 4 arrays of some non-trivial length. One suggestion I have come up with would be to group data in tuples of four, i.e. have one array of structures like (entryTick, exitOnStrat, exitOnLoss, exitOnProfit). This might be faster due to better cache predictability, but I cannot say for sure. Why I haven't tested it so far is that instrumenting profilers somehow don't work for release binaries of my app while sampling profilers seem to me to be unreliable (I have tried Intel's profiler).

So the final questions are: can this problem be made faster? What is the best profiler to use for mem profiling with release binaries? I work on Win7, VS2010.

Edit: Many thanks to all. I tried to simplify my original question as much as possible, hence the confusion. Just to make sure it's readable - target means an envisaged/realized profit, stop means an envisaged/realized loss.

The optimizer is a brute-force one. So, i have some strat settings (e.g. indicator periods, whatever), then min/max breakEvenAfter/breakEvenBy and then formulas to give you stop/target values in ticks. These formulas are also objects of optimization. Hence, I have a structure of optimization like

for each in params
{
   calculateEntries()
   for each in beSettings
   {
      precalculateBeData()
      for each in targetFormulaSettings
      {
          precalculateTargetsAndRespectiveExitTicks
          for each in stopFormulaSettings
          {
              precalulcateStopsAndRespectiveExitsTicks
              evaluateExitsAndDetermineImprovement()
          }
       }
    }
}

So I precalculate stuff as much as possible and only calculate something when I need it. And out of 30 seconds, the calculation spends 25 seconds in the evaluateExitsAndDetermineImprovement() function which does just what I described in the original question, i.e. picks min(exitOnPattern, exitOnStop, exitOnTarget). The reason why I need to call the function 100M times is because I have 100M combinations of all params combined. But within the last for cycle only the exitOnStops array changes. I can post some code if that helps. Im grateful for all the comments!

回答1:

I don't know much about trading strategies, but i usually do some optimisation. Well, there are many optimisation methods. Like, type of container, using a different min function(i think boost has a somewhat faster function than in stl library), try reducing same calculations,etc. Also you can optimise by using faster functions to gain speed, or by redesinging your algorithm.

For profiling I use GlowCode under Win7 x64, and it's ok for release builds too.

回答2:

Maybe I misunderstand your system completely, but:
what is it that you "pre-calculate" and when and WHY 100M times???

I don't know if it will help you but it may simplify your system significantly - there are 2 common trading strategies: (descriptions are my and not official)
1) "fixed point exit" - when the trade happens all exit points are calculated once and they are checked against market conditions/price periodically.
2) "variable point exit" - when the market moves the exit points are recalculated (usually to lock in more profit/reduce loss).

In case 1) the actual calculation happens only once so it should be VERY fast
In case 2) the calculations will happen every time, but it can be optimised in many different ways - one of them being that you may store your trades indexed by exit points and only get and re-calculate those close to the actual market situation.

I am not sure which cache misses you are referring to? You data cache? CPU cache?

回答3:

So, after some work, I understood the advice by Alexandre C. When I ran cache-miss profiling, I found that out of 15M calls of the evaluateExits() function I have only 30K cache misses hence the performance of this function cannot be hindered by cache. Hence, I had to "start believing" that VTune is actually producing valid results, albeit weird. Since the analysis of VTune output does not match the current thread's name, I decided to start a new thread. Thank you all for opinions and recommendations.

来源：https://stackoverflow.com/questions/12724887/caching-in-a-high-performance-financial-application

标签

c++

caching

memory

memory-profiling