Fusing a triangle loop for parallelization, calculating sub-indices

前端 未结 3 1746
余生分开走
余生分开走 2020-12-10 06:42

A common technique in parallelization is to fuse nested for loops like this

for(int i=0; i

to

相关标签:
3条回答
  • 2020-12-10 07:12

    Considering that you're trying to fuse a triangle with the intent of parallelizing, the non-obvious solution is to choose a non-trivial mapping of x to (i,j):

    j |\ i ->
      | \             ____
    | |  \    =>    |\\   |
    V |___\         |_\\__|
    

    After all, you're not processing them in any special order, so the exact mapping is a don't care.

    So calculate x->i,j as you'd do for a rectangle, but if i > j then { i=N-i, j = N-j } (mirror Y axis, then mirror X axis).

       ____
     |\\   |      |\           |\
     |_\\__|  ==> |_\  __  =>  | \
                      / |      |  \
                     /__|      |___\
    
    0 讨论(0)
  • 2020-12-10 07:13

    The most sane form is of course the first form.

    That said, the fused form is better done with conditionals:

    int i = 0; int j = 0;
    for(int x=0; x<(n*(n+1)/2); x++) {
      // ...
      ++j;
      if (j>i)
      {
        j = 0;
        ++i;
      }
    }
    
    0 讨论(0)
  • 2020-12-10 07:14

    I'm wondering if there is a simpler or more efficient way of doing this?

    Yes, the code you had to begin with. Please keep the following in mind:

    • There exists no case where floating point arithmetic is ever faster than plain integers.
    • There does however exist plenty of cases where floating point is far slower than plain integers. FPU or no FPU.
    • Float variables are generally larger than plain integers on most systems and therefore slower for that reason alone.
    • The first version of the code is likely most friendly to the cache memory. As for any case of manual optimization, this depends entirely on what CPU you are using.
    • Division is generally slow on most systems, no matter if done to plain integers or floats.
    • Any form of complex arithmetic is going to be slower than simple counting.

    So your second example is pretty much guaranteed to be far slower than the first example, for any given CPU in the world. In addition, it is also completely unreadable.

    0 讨论(0)
提交回复
热议问题