Cuda PTX registers declaration and using

本秂侑毒 提交于 2019-12-12 01:46:34

问题


I am trying to reduce number of using registers in my kernel, so I am decide to try inline PTX.

This kernel:

#define Feedback(a, b, c, d, e) d^e^(a&c)^(a&e)^(b&c)^(b&e)^(c&d)^(d&e)^(a&d&e)^(a&c&e)^(a&b&d)^(a&b&c)

__global__ void Test(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
    res[0] = Feedback( a, b, c, d, e );  
    res[1] = Feedback( b, c, d, e, f );
    res[2] = Feedback( c, d, e, f, j );  
    res[3] = Feedback( d, e, f, j, h );
}  

Using 14 registers, I am thinking this is more than needs, so I am write Inline PTX:

    __global__ void Feedback_ASM(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
asm(".reg .u32 %r<10>;\n");

// 1
asm("ld.param.u32   %r1, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_a];\n"
    "ld.param.u32   %r2, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_b];\n"
    "ld.param.u32   %r3, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_c];\n"
    "ld.param.u32   %r4, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_d];\n"
    "ld.param.u32   %r5, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_e];\n");

asm("and.b32 %r7, %r1, %r3;\n"
    "xor.b32 %r8, %r7, %r4;\n"
    "xor.b32 %r7, %r8, %r5;\n"
    "and.b32 %r8, %r1, %r5;\n"
    "xor.b32 %r9, %r7, %r8;\n"
    .............................
    "xor.b32 %r8, %r7, %r9;\n"
    "and.b32 %r6, %r1, %r2;\n"
    "and.b32 %r7, %r6, %r3;\n"
    "xor.b32 %r9, %r7, %r8;\n");

asm("ld.param.u32   %r8, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_res];\n"
    "st.global.u32  [%r8+0], %r9;");     
// 2
...
// 3
...
// 4
...
}     

But this kernel uses 14 registers too! I am a little confused. I declared only 10 registers, In the ptx file there are no other variables. How I can solve this situation?


回答1:


As indicated already, PTX is an intermediate code. PTX "registers" are virtual registers and don't necessarily reflect actual device register usage.

To get an idea of actual device register usage, compile using the ptxas verbose option:

nvcc -Xptxas -v ...

or use one of the profilers. You can also inspect the machine code directly using:

cuobjdump -sass myexe

(where myexe is replaced with the name of your executable).

To control register usage, you can use the nvcc compile option:

nvcc -maxrregcount 10 ...

(where 10 is replaced with how many registers per thread you want all kernels in your code to be limited to) or you can use the launch bounds directive in your code, which can control register usage on a kernel-by-kernel basis.



来源:https://stackoverflow.com/questions/29297960/cuda-ptx-registers-declaration-and-using

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!