Difference between kernels construct and parallel construct

冷暖自知 提交于 2019-12-31 10:40:24

问题


I study a lot of articles and the manual of OpenACC but still i don't understand the main difference of these two constructs.


回答1:


kernels directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels directive (since it may translate to multiple underlying "kernels").

A good treatment of this specific question is contained in this PGI article.

Quoting from the article summary: "The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "




回答2:


OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.

OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.

Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.

I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.




回答3:


Parallel Construct

  1. Defines the region of the program that should be compiled for parallel execution on the accelerator device.

  2. The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.

  3. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.

  4. When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.

  5. More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.

e.g refer

#pragma acc parallel
{
    #pragma acc loop
    for (i=0; i<n; i++) 
         a[i] = 3.0f*(float)(i+1);
    #pragma acc loop
    for (i=0; i<n; i++) 
         b[i] = 2.0f*a[i];
}

 Generate one kernel

 There is no barrier between the two loops: the second loop may start before the first loop ends. (This is different from OpenMP).

Kernels Construct

  1. Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.

  2. An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.

  3. The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.

  4. One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.

  5. Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.

e.g refer

#pragma acc kernels
{
   for (i=0; i<n; i++)
       a[i] = 3.0f*(float)(i+1);
   for (i=0; i<n; i++)
        b[i] = 2.0f*a[i];
}

 Generate two kernels

 There is an implicit barrier between the two loops: the second loop will start after the first loop ends.



来源:https://stackoverflow.com/questions/13460903/difference-between-kernels-construct-and-parallel-construct

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!