cuda | 易学教程

Using maximum shared memory in Cuda

阅读更多关于 Using maximum shared memory in Cuda

问题 I am unable to use more than 48K of shared memory (on V100, Cuda 10.2) I call cudaFuncSetAttribute(my_kernel, cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared); before launching my_kernel first time. I use launch bounds and dynamic shared memory inside my_kernel : __global__ void __launch_bounds__(768, 1) my_kernel(...) { extern __shared__ float2 sh[]; ... } Kernel is called like this: dim3 blk(32, 24); // 768 threads as in launch_bounds. my_kernel<<<grd, blk, 64

显卡、显卡驱动、显存、GPU、CUDA、cuDNN

阅读更多关于显卡、显卡驱动、显存、GPU、CUDA、cuDNN

显卡 Video card，Graphics card，又叫显示接口卡，是一个硬件概念（相似的还有网卡），执行计算机到显示设备的数模信号转换任务，安装在计算机的主板上，将计算机的数字信号转换成模拟信号让显示器显示出来。显卡是计算机的标配之一，计算机要显示图像就必须安装显卡。普通计算机的显卡一般是集成在主板上的。显卡驱动显卡驱动是显卡跟计算机连接的桥梁，可以让计算机识别到GPU硬件，是必须正确安装的，不同厂商、不同型号的GPU对应不同的显卡驱动。非开发人员不用安装CUDA或cuDNN，但是一定要安装显卡驱动。查看Nvidia显卡和显卡驱动版本信息： nvidia-smi 显存又叫帧缓存，作用是用来存储GPU处理过或者即将提取的渲染数据，显存相对于GPU相当于内存对于CPU。 GPU Graphics Processing Unit，图形处理单元，GPU是显卡上的一块芯片，也是显卡的核心设备，GPU和显卡的关系类似于CPU和主板的关系。早期的GPU主要用于图形渲染，可以让电脑显示更加逼真、细节更加丰富的图形图像，牛逼的GPU可以跑大型3D游戏显示流畅不卡顿，这也是人们对GPU或显卡的最初认识。后来人们发现GPU还可以做更多的工作，例如执行矩阵运算和浮点运算，特别是可以用来加速神经网络模型的训练，GPU也在并行计算这条路上越走越远。可以说GPU让人工智能有了更多可能。

使用Redshift渲染器，怎么选电脑配置！

阅读更多关于使用Redshift渲染器，怎么选电脑配置！

碎碎念：最近拿到一个工业模型渲染测试，它有三千六百万个三角面，内存使用了近16G。家里电脑显卡是GTX960，才2G显存，完全不够用啊。稍微加点模型进去，就超显存不渲了，很是揪心，谁让我选了redshift渲呢。。但不可否认它确实出图很快啊。为了测试继续进行下去，我不断优化场景，尽可能的删掉看不到物体。最终模型三角面减到了两千百万，才勉强能够渲成图。小心翼翼调整采样，都不敢多加半点参数。门都不敢打开，糟心的测试，真是够够了。贴一张渲染出来的图留念一下吧！这只是别人发给我的测试，可能不是西门子的东西，自己瞎放上去的。如有不合适，请联系我删掉。。恩！回归到正题，要跟上时代潮流家里就得换电脑了。于是我研读了一下redshift的对于硬件的需求。划重点：游戏显卡跟专业显卡在redshift渲染性能上没差异。windows系统选专业显卡会快。 1GB的显存中容纳大约2千万到3千3百万个三角面。算算自己的需求量吧。单显卡，显存越高越好啦，渲染会更快。多个GPU的显存不能叠加在一起用。如果是多显卡一起使用，可以一次渲多帧，会节约渲染时间。 CPU单核高GHz是更好的选择。至少安装比GPU大两倍的内存。本地缓存纹理文件夹所在的磁盘，使用固态硬盘会加快读取速度。网络读取速度会影响渲染速度，请用大量代理文件（例如3000w面的代理）测试本地路径和网络路径渲染时间差异。

YOLOv5模型训练

阅读更多关于 YOLOv5模型训练

软硬件环境 ubuntu 18.04 64bit anaconda with 3.7 nvidia gtx 1070Ti cuda 10.1 pytorch 1.5 YOLOv5 YOLOv5环境配置请参考之前的文章，YOLOv5目标检测使用COCO数据集 YOLOv5 的预训练模型是基于 COCO 数据集，如果自己想去复现下训练过程，可以依照下面的命令 $ python train.py --data coco.yaml --cfg yolov5s.yaml --weights '' --batch-size 64 yolov5m 48 yolov5l 32 yolov5x 16 COCO 的数据集可以通过 data 文件夹下 get_coco2017.sh 脚本进行下载，包含图片和 lable 文件。 COCO 的数据集实在是太大了，整个压缩包有18G，考虑到自己到的网速还有机器的算力，还是洗洗睡吧。。。制作自己的数据集如果没有对应目标的公开数据集，那就只有自己出手收集了，图片到手后，接下来就是艰辛的打标签工作了，这里使用工具 LabelImg ，下载地址是 https://github.com/tzutalin/labelImg/releases/tag/v1.8.1 LabelImg 使用 Qt 做了图形化的界面，操作还是很方便的，这也是选择它的理由，它提供了默认的

首次学习BERT的pytroch中遇到的问题-句子特征提取

阅读更多关于首次学习BERT的pytroch中遇到的问题-句子特征提取

参考链接： https://blog.csdn.net/weixin_41519463/article/details/100863313 import torch import torch.nn as nn from pytorch_transformers import BertModel, BertConfig,BertTokenizer # 使用gpu device0 = torch.device( " cuda " if torch.cuda.is_available() else " cpu " ) # 输入处理 tokenizer = BertTokenizer.from_pretrained( ' bert-base-uncased ' ) # 从预训练模型中加载tokenizer # text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"#开始结束标记 # tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词 # indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#词在预训练词表中的索引列表 # segments_ids = [0,

Thrust scan of just one class member

阅读更多关于 Thrust scan of just one class member

问题 I have a custom class myClass which has members weight and config . I'd like to run an inclusive scan on a bunch of myClass es, but only on the weight s. Basically what I want is to take: [ {configA, weightA}, {configB, weightB}, {configC, weightC}, ...] to: [ {configA, weightA}, {configB, weight A + weightB}, {configC, weight A + weight B + weightC}, ...] Is there a simple way to do this using Thrust's fancy iterators? Since the binaryOp is required to be associative, I don't see how to do

CUDA: Atomic operations on unsigned chars

阅读更多关于 CUDA: Atomic operations on unsigned chars

问题 I'm a CUDA beginner. I have a pixel buffer of unsigned chars in global memory that can and is updated by any and all threads. To avoid weirdness in the pixel values, therefore, I want to perform an atomicExch when a thread attempts to update one. But the programming guide says that this function only works on 32- or 64-bit words, whereas I just want to atomically exchange one 8-bit byte. Is there a way to do this? Thanks. 回答1: You might implement a critical section using a mutex variable. So

CUDA: Atomic operations on unsigned chars

阅读更多关于 CUDA: Atomic operations on unsigned chars

CUDA: Atomic operations on unsigned chars

阅读更多关于 CUDA: Atomic operations on unsigned chars

How to generalize fast matrix multiplication on GPU using numba

阅读更多关于 How to generalize fast matrix multiplication on GPU using numba

问题 Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the

订阅 cuda