问题
I am trying to reduce number of using registers in my kernel, so I am decide to try inline PTX.
This kernel:
#define Feedback(a, b, c, d, e) d^e^(a&c)^(a&e)^(b&c)^(b&e)^(c&d)^(d&e)^(a&d&e)^(a&c&e)^(a&b&d)^(a&b&c)
__global__ void Test(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
res[0] = Feedback( a, b, c, d, e );
res[1] = Feedback( b, c, d, e, f );
res[2] = Feedback( c, d, e, f, j );
res[3] = Feedback( d, e, f, j, h );
}
Using 14 registers, I am thinking this is more than needs, so I am write Inline PTX:
__global__ void Feedback_ASM(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
asm(".reg .u32 %r<10>;\n");
// 1
asm("ld.param.u32 %r1, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_a];\n"
"ld.param.u32 %r2, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_b];\n"
"ld.param.u32 %r3, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_c];\n"
"ld.param.u32 %r4, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_d];\n"
"ld.param.u32 %r5, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_e];\n");
asm("and.b32 %r7, %r1, %r3;\n"
"xor.b32 %r8, %r7, %r4;\n"
"xor.b32 %r7, %r8, %r5;\n"
"and.b32 %r8, %r1, %r5;\n"
"xor.b32 %r9, %r7, %r8;\n"
.............................
"xor.b32 %r8, %r7, %r9;\n"
"and.b32 %r6, %r1, %r2;\n"
"and.b32 %r7, %r6, %r3;\n"
"xor.b32 %r9, %r7, %r8;\n");
asm("ld.param.u32 %r8, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_res];\n"
"st.global.u32 [%r8+0], %r9;");
// 2
...
// 3
...
// 4
...
}
But this kernel uses 14 registers too! I am a little confused. I declared only 10 registers, In the ptx file there are no other variables. How I can solve this situation?
回答1:
As indicated already, PTX is an intermediate code. PTX "registers" are virtual registers and don't necessarily reflect actual device register usage.
To get an idea of actual device register usage, compile using the ptxas verbose option:
nvcc -Xptxas -v ...
or use one of the profilers. You can also inspect the machine code directly using:
cuobjdump -sass myexe
(where myexe
is replaced with the name of your executable).
To control register usage, you can use the nvcc compile option:
nvcc -maxrregcount 10 ...
(where 10 is replaced with how many registers per thread you want all kernels in your code to be limited to) or you can use the launch bounds directive in your code, which can control register usage on a kernel-by-kernel basis.
来源:https://stackoverflow.com/questions/29297960/cuda-ptx-registers-declaration-and-using