gcc auto-vectorisation (unhandled data-ref)

和自甴很熟 提交于 2019-12-06 04:46:17

GCC cannot vectorise the first version of your loop because it cannot prove that pfTab[iIndex] is not contained somewhere within the memory spanned by pfResult[0] ... pfResult[iSize-1] (pointer aliasing). Indeed, if pfTab[iIndex] is somewhere within that memory, then its value must be overwritten by the assignment in the loop body and the new value must be used in the iterations to follow. You should use the restrict keyword to hint the compiler that this could never happen and then it should happily vectorise your code:

$ cat foo.c
int MyFunc(const float *restrict pfTab, float *restrict pfResult,
           int iSize, int iIndex)
{
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}
$ gcc -v
...
gcc version 4.6.1 (GCC)
$ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:3: note: LOOP VECTORIZED.
foo.c:1: note: vectorized 1 loops in function.

The second version vectorises since the value is transferred to a variable with an automatic storage duration. The general assumption here is that pfResult does not span over the stack memory where fTab is stored (a cursory read through the C99 language specification doesn't make it clear if that assumption is weak or something in the standard allows it).

The OpenMP version does not vectorise because of the way OpenMP is implemented in GCC. It uses code outlining for the parallel regions.

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

effectively becomes:

struct omp_data_s
{
  float *pfResult;
  int iSize;
  float *fTab;
};

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  struct omp_data_s omp_data_o;

  omp_data_o.pfResult = pfResult;
  omp_data_o.iSize = iSize;
  omp_data_o.fTab = fTab;

  GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
  MyFunc._omp_fn.0 (&omp_data_o);
  GOMP_parallel_end ();
  pfResult = omp_data_o.pfResult;
  iSize = omp_data_o.iSize;
  fTab = omp_data_o.fTab;
}

void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
{
  int start = ...; // compute starting iteration for current thread
  int end = ...; // compute ending iteration for current thread

  for (int i = start; i < end; i++)
    omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
}

MyFunc_omp_fn0 contains the outlined function code. The compiler is not able to prove that omp_data_i->pfResult does not point to memory that aliases omp_data_i and specifically its member fTab.

In order to vectorise that loop, you have to make fTab firstprivate. This will turn it into an automatic variable in the outlined code and that will be equivalent to your second case:

$ cat foo.c
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
   float fTab = pfTab[iIndex];
   #pragma omp parallel for firstprivate(fTab)
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}
$ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: LOOP VECTORIZED.
foo.c:4: note: vectorized 1 loops in function.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!