I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cud
I am rather late but GPU Lynx does exactly that: take a CUDA fat binary, parse the PTX, and modify it before emitting the result to the driver for execution on a GPU. You can optionally print out the modified PTX as well.