I am writing a simple code in CUDA to perform matrix multiplication. The input for the CUDA global void multiply() are the matrices A and B expressed as &q