Confusion with CUDA PTX code and register memory

你。 提交于 2019-12-03 13:38:00

PTX is an intermediate language that is designed to be portable across multiple GPU architectures. It gets compiled by the compiler component PTXAS into final machine code, also refered to as SASS, for a particular architecture. The nvcc option -Xptxas -v causes PTXAS to report various statistics about the generated machine code, including the number of physical registers used in the machine code. You can inspect the machine code by disassembling it with cuobjdump --dump-sass.

So the number of registers one sees used in PTX code has no significance, since these are virtual registers. The CUDA compiler generates PTX code in what is known as SSA form (static single assignment, see http://en.wikipedia.org/wiki/Static_single_assignment_form). This basically means that each new result written is assigned a new register.

The instruction mul.wide is described in the PTX specification, the current version of which (3.1) you can find here: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html . In your example code, the suffix .u16 means that it multiplies two unsigned 16-bit quantities and returns an unsigned 32-bit result, i.e. it computes the full, double-width product of the source operands.

Virtual registers in PTX are typed, but their names can be chosen freely, independent of type. The CUDA compiler appears to follow certain conventions that are (to my knowledge) not documented since they are internal implementation artifacts. Looking at a bunch of PTX code it is clear that the register names currently generated encode type information, this may be done for ease of debugging: p<num> is used for predicates, r<num> for 32-bit integers, rd<num> for 64-bit integers, f<num> for 32-bit floats, and fd<num> for 64-bit doubles. You can easily see this for yourself by looking at the .reg directives in the PTX code that create these virtual registers.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!