How to create or manipulate GPU assembler?

后端 未结 6 1124
难免孤独
难免孤独 2020-12-13 07:30

Does any one have experience in creating/manipulating GPU machine code, possibly at run-time?

I am interested in modifying GPU assembler code, possibly at run tim

6条回答
  •  既然无缘
    2020-12-13 08:02

    NVIDIA PTX generation and modification

    Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:

    • OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES + clCreateProgramWithBinary: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?

      These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.

      So you can dump the PTX, modify it, and reload.

    • nvcc: can compile CUDA GPU-side code to ptx assembly simply with either:

      nvcc --ptx a.cu
      

      nvcc can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/

      And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary: How to compile PTX code

    Using clGetProgramInfo, an input CL kernel:

    __kernel void kmain(__global int *out) {
        out[get_global_id(0)]++;
    }
    

    gets compiled to some PTX like:

    //
    // Generated by NVIDIA NVVM Compiler
    //
    // Compiler Build ID: CL-21124049
    // Cuda compilation tools, release 8.0, V8.0.44
    // Based on LLVM 3.4svn
    //
    
    .version 5.0
    .target sm_20
    .address_size 64
    
        // .globl   _Z3incPi
    
    .visible .entry _Z3incPi(
        .param .u64 _Z3incPi_param_0
    )
    {
        .reg .pred  %p<2>;
        .reg .b32   %r<4>;
        .reg .b64   %rd<5>;
    
    
        ld.param.u64    %rd1, [_Z3incPi_param_0];
        mov.u32     %r1, %ctaid.x;
        setp.gt.s32 %p1, %r1, 2;
        @%p1 bra    BB0_2;
    
        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32    %rd3, %r1, 4;
        add.s64     %rd4, %rd2, %rd3;
        ldu.global.u32  %r2, [%rd4];
        add.s32     %r3, %r2, 1;
        st.global.u32   [%rd4], %r3;
    
    BB0_2:
        ret;
    }
    

    Then if for example you modify the line:

    add.s32     %r3, %r2, 1;
    

    to:

    add.s32     %r3, %r2, 2;
    

    and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.

提交回复
热议问题