g++ , range based for and vectorization

半腔热情 提交于 2020-03-24 04:56:42

问题


considering the following range based for loop in C++ 11

for ( T k : j )
{
  ...
}

there are g++ or clang++ optimization flags that can speed up the compiled code ?

I'm not talking about any for cycle I'm only considering this new C++11 construct.


回答1:


Optimizing loops is very rarely about optimizing the actual loop iteration code (for ( T k : j ) in this case), but very much about optimizing what is IN the loop.

Now, since this is ... in this case, it's impossible to say if, for example, unrolling the loop will help, or declaring functions inline [or simply moving them so that the compiler can see them and put them inline], using auto-vectorization, or perhaps using a completely different algorithm inside the loop.

The examples in the paragraph above in a bit more detail:

  1. Unrolling the loop - essentially do several of the loop iterations without going back to the start of the loop. This is most helpful when the loop content is very small. There is automatic unrolling, where the compiler does the unrolling, or you can unroll the code manually, by simply doing, say, four items in each loop iteration and then stepping four items forward in each loop variable update or updating the iterator multiple times during the loop itself [but this of course means not using the range-based for-loop].
  2. Inline functions - the compiler will take (usually small) functions and place them into the loop itself, rather than having the call. This saves on the time it takes for the processor to call out to another place in the code and return back. Most compilers only do this for functions that are "visible" to the compiler during compilation - so the source has to be either in the same source file, or in a header file that is included in the source file that is compiled.
  3. Auto-vectorisation - using SSE, MMX or AVX instructions to process multiple data items in one instruction (e.g. one SSE instruction can add four float values to another four float in one instruction). This is faster than operating on a single data item at a time (most of the time, sometimes it's no benefit because of additional complications with trying to combine the different data items and then sorting out what goes where when the calculation is finished).
  4. Choose different algorithm - there are often several ways to solve a particular problem. Depending on what you are trying to achieve, a for-loop [of whatever kind] may not be the right solution in the first place, or the code inside the loop could perhaps use a more clever way to calculate/rearrange/whatever-it-does to achieve the result you need.

But ... is far too vague to say which, if any, of the above solutions will work to improve your code.




回答2:


GCC documentation about auto-vectorization doesn't mention anything about the range-based for loop. Also, its code boils down to:

{
    auto && __range = range_expression ;
    for (auto __begin = begin_expr,
                __end = end_expr;
            __begin != __end; ++__begin) {
        range_declaration = *__begin;
        loop_statement
    }
}

So, technically speaking, any flag helping to auto-vectorize the constructs in this kind of regular for should auto-vectorize a similar range-based for loop. I really do this compilers only translate range-based for loops to regular for loops, then let the auto-vectorization do its job on these old loops. Which means that there is no need for a flag to tell your compiler to auto-vectorize your range-based for loops in a any scenario.


Since GCC's implementation was asked for, here is the relevant comment in the source code describing what is actually done for the range-based for loop (you can check the implementation file parser.c if you want to have a look at the code):

/* Converts a range-based for-statement into a normal
   for-statement, as per the definition.

      for (RANGE_DECL : RANGE_EXPR)
    BLOCK

   should be equivalent to:

      {
    auto &&__range = RANGE_EXPR;
    for (auto __begin = BEGIN_EXPR, end = END_EXPR;
          __begin != __end;
          ++__begin)
      {
          RANGE_DECL = *__begin;
          BLOCK
      }
      }

   If RANGE_EXPR is an array:
    BEGIN_EXPR = __range
    END_EXPR = __range + ARRAY_SIZE(__range)
   Else if RANGE_EXPR has a member 'begin' or 'end':
    BEGIN_EXPR = __range.begin()
    END_EXPR = __range.end()
   Else:
    BEGIN_EXPR = begin(__range)
    END_EXPR = end(__range);

   If __range has a member 'begin' but not 'end', or vice versa, we must
   still use the second alternative (it will surely fail, however).
   When calling begin()/end() in the third alternative we must use
   argument dependent lookup, but always considering 'std' as an associated
   namespace.  */

As you can see, they do nothing more than what the standard is actually describing.



来源:https://stackoverflow.com/questions/16688419/g-range-based-for-and-vectorization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!