I have a C++ snippet below with a run-time for
loop,
for(int i = 0; i < I; i++)
for (int j = 0; j < J; j++)
A( row(i,j), column(i,j)
I'm not a fan of template meta-programming, so you may want to take this answer with a pinch of salt. But, before I invested any time on this problem, I'd ask myself the following:
for
loop a bottleneck?In many compilers/cpus, the "looped" version can give better performance due to cache effects.
Remember: Measure first, optimise later - if at all.
If you are willing to modify the syntax a bit you can do something like this:
template <int i, int ubound>
struct OuterFor {
void operator()() {
InnerFor<i, 0, J>()();
OuterFor<i + 1, ubound>()();
}
};
template <int ubound>
struct OuterFor <ubound, ubound> {
void operator()() {
}
};
In InnerFor, i is the outer loops counter (compile time constant), j is the inner loops counter (initially 0 - also compile time constant), so you can evaluate row as a compile time template.
Its a bit more complicated, but as you say, row(), col(), and f() are your complicated parts anyways. At least try it and see if the performance is worth it. It may be worth it to investigate other options to simplify your row(), etc functions.
You could use Boost MPL.
An example of loop unrolling is on this mpl::for_each page.
for_each< range_c<int,0,10> >( value_printer() );
It doesn't seem that it's all evaluated at compile time, but it may be a good starting point.
I've never tried to do this so take this idea with a grain of salt...
It seems like you could use Boost.Preprocessor to do the loop unrolling (specifically the BOOST_PP_FOR and BOOST_PP_FOR_r macros) and then use templates to generate the actual constant expression.
You could use Boost.Mpl to implement the whole thing at compile-time, but I'm not sure it'll be faster. (Mpl essentially re-implements all the STL algorithms as compile-time metaprogramming templates)
The problem with that approach is that you end up unrolling and inlining a lot of code, which may thrash the instruction cache and eat up memory bandwidth that could have been saved. That may produce huge, bloated and slow code.
I would probably probably rather trust the compiler to inline the functions that make sense. As long as the row
and column
function definitions are visible from the loop, the compiler can trivially inline the calls and unroll as many iterations as it deems beneficial.
I would say it is a false good-idea.
In C++ this :
row<i,j>::value
means you will have as many differents row<>()
functions than you have i * j. You don't want this because it will increase the size of the code and do a lot of instruction cache misses.
I observed this when I was doing template functions to avoid a single boolean check.
If is a short function just inline it.