cuda

Appropriate usage of cudaGetSymbolAddress and cudaMemcpyToSymbol with global memory?

谁都会走 提交于 2021-02-05 07:56:07
问题 I am fairly new to CUDA and am familiar with the normal usage of cudaMalloc and cudaMemcpy and also with cudaMemcpyToSymbol for copying to constant memory. However, I have just been given some code which makes frequent use of cudaGetSymbolAddress and cudaMemcpyToSymbol to copy to global memory and I'm not sure why they have chosen to do this instead of cudaMalloc / cudaMemcpy . Would somebody be able to explain when it is advantageous and appropriate to use cudaGetSymbolAddress and

Structure in Texture memory on CUDA

痞子三分冷 提交于 2021-02-04 08:28:25
问题 I have an array containing a structure of two elements, that I send to CUDA in global memory, and I read the values from global memory. As I read through some books and posts, and as I am only reading values from the structure, I thought i would be interesting if it was possible to store my array in Texture memory. I used the following code outside the kernel : texture<node, cudaTextureType1D, cudaReadModeElementType> textureNode; and the following lines in main() gpuErrchk(cudaMemcpy(tree_d,

深度学习框架PyTorch的技巧总结

邮差的信 提交于 2021-02-02 10:44:02
1.在训练模型时指定GPU的编号 设置当前使用的GPU设备仅为0号设备,设备名称为"/gpu:0", os.environ["CUDA_VISIBLE_DEVICES"]="0" ; 设置当前使用的GPU设备为0,1两个设备,名称依次为"/gpu:0","/gpu:1", os.environ["CUDA_VISIBLE_DEVICES"]="0,1" ;根据顺序优先表示使用0号设备,然后使用1号设备; 同样,也可以在训练脚本外面指定, CUDA_VISIBLE_DEVICES=0,1 python train.py ,注意,如果此时使用的是8卡中的6和7, CUDA_VISIBLE_DEVICES=6,7 python train.py ,但是在模型并行化的时候,仍然指定0和1, model=nn.DataParallel(mode, devices=[0,1] ; 在这里,需要注意的是,指定GPU的命令需要放在和网络模型操作的最前面; 2.查看模型每层的输如输出详情 1.需要安装torchsummary或者torchsummaryX(pip install torchsummary); 2.使用示例如下: from torchvision import models vgg16 = models . vgg16 ( ) vgg16 = vgg16 . cuda ( ) # 1

How to implement device side CUDA virtual functions?

杀马特。学长 韩版系。学妹 提交于 2021-02-02 08:16:00
问题 I see that CUDA doesn't allow for classes with virtual functions to be passed into kernel functions. Are there any work-arounds to this limitation? I would really like to be able to use polymorphism within a kernel function. Thanks! 回答1: The most important part of Robert Crovella's comment is: The objects simply need to be created on the device. So keeping that in mind, I was dealing with situation where I had an abstract class Function and then some implementations of it encapsulating

How to implement device side CUDA virtual functions?

淺唱寂寞╮ 提交于 2021-02-02 08:13:54
问题 I see that CUDA doesn't allow for classes with virtual functions to be passed into kernel functions. Are there any work-arounds to this limitation? I would really like to be able to use polymorphism within a kernel function. Thanks! 回答1: The most important part of Robert Crovella's comment is: The objects simply need to be created on the device. So keeping that in mind, I was dealing with situation where I had an abstract class Function and then some implementations of it encapsulating

How to implement device side CUDA virtual functions?

ぃ、小莉子 提交于 2021-02-02 08:13:19
问题 I see that CUDA doesn't allow for classes with virtual functions to be passed into kernel functions. Are there any work-arounds to this limitation? I would really like to be able to use polymorphism within a kernel function. Thanks! 回答1: The most important part of Robert Crovella's comment is: The objects simply need to be created on the device. So keeping that in mind, I was dealing with situation where I had an abstract class Function and then some implementations of it encapsulating

安装mxnet失败的第31天!!!

蓝咒 提交于 2021-02-01 09:45:55
深度学习框架mxnet的安装: 放在前面,我终于成功了 C:\ProgramData\Anaconda3>python . exe - c "import mxnet as mx; print(mx.nd.zeros((1,2), ctx=mx.gpu()) + 1)" [ [ 1 . 1 . ] ] <NDArray 1x2 @gpu ( 0 ) > 根据各方提示,我先是去NVIDIA的控制面板查看自己需要安装的CUDA 的版本,显示我可以安的是cuda10.2,去官网,失败了N天以后,学会了百度下有没有好心人的国内资源。。。。 成功安好CUDA10.2,本身已经有了Anacoda,所以就剩下安装Mxnet了,然后按照教程进行了下面的操作: D: / / PortableProgram / Anaconda3 / python . exe - m pip install -- pre mxnet - cu102 - f https: / / dist . mxnet . io / python / cu102 果不其然的,失败了好多天,每天尝试一次,每天失败一次,网络上各种说法都有,每天试一试,心情特别好。。。。。 解决方案: 最终在今天,我决定放弃所谓的最适合的版本,卸载了CUDA10.2,重新安装了cuda10.1 再回到教程,进行安装: D: / /

memset cuArray for surface memory

≡放荡痞女 提交于 2021-01-29 22:42:43
问题 Say you have a cuArray for binding a surface object. Something of the form: // These are inputs to a function really. cudaArray* d_cuArrSurf cudaSurfaceObject_t * surfImage; const cudaExtent extent = make_cudaExtent(width, height, depth); cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); cudaMalloc3DArray(&d_cuArrSurf, &channelDesc, extent); // Bind to Surface cudaResourceDesc surfRes; memset(&surfRes, 0, sizeof(cudaResourceDesc)); surfRes.resType = cudaResourceTypeArray;

CUDA error identifier “__stcs” is undefined

拜拜、爱过 提交于 2021-01-29 16:52:35
问题 I want to use store function with cache hint __stcs on GPU Pascal, CUDA 10.0. In CUDA C++ Programming Guide there is no mention of any header for data type unsigned long long , but the compiler return an error identifier "__stcs" is undefined . How to fix this compilation error? 回答1: These intrinsics require CUDA 11.0, they are new for CUDA 11.0. If you look at the CUDA 10.0 programming guide you will see they are not mentioned. You can also see that they are mentioned in the "changes from

Don't understand why column addition faster than row in CUDA

浪尽此生 提交于 2021-01-29 12:36:19
问题 I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n. __global__ void th_single_row_add(float* a, float* b, float* c, int n) { int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n; for (int i = 0; i < n; i ++) { if (idx + i >= n*n) return; c[idx + i] = a[idx + i] + b[idx + i]; } } __global__ void th_single_col_add(float* a, float* b, float* c, int n) { int idx = blockDim.x * blockIdx.x + threadIdx.x; for (int i = 0;