I need to convert the following code from C++ with OpenMP to C++ with CUDA. As answered in this question: CUDA access matrix stored in RAM and possibility of being implement