cuda

Using maximum shared memory in Cuda

℡╲_俬逩灬. 提交于 2021-01-15 11:54:01
问题 I am unable to use more than 48K of shared memory (on V100, Cuda 10.2) I call cudaFuncSetAttribute(my_kernel, cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared); before launching my_kernel first time. I use launch bounds and dynamic shared memory inside my_kernel : __global__ void __launch_bounds__(768, 1) my_kernel(...) { extern __shared__ float2 sh[]; ... } Kernel is called like this: dim3 blk(32, 24); // 768 threads as in launch_bounds. my_kernel<<<grd, blk, 64

显卡、显卡驱动、显存、GPU、CUDA、cuDNN

流过昼夜 提交于 2021-01-13 05:54:59
显卡 Video card,Graphics card,又叫显示接口卡,是一个硬件概念(相似的还有网卡),执行计算机到显示设备的数模信号转换任务,安装在计算机的主板上,将计算机的数字信号转换成模拟信号让显示器显示出来。 显卡是计算机的标配之一,计算机要显示图像就必须安装显卡。普通计算机的显卡一般是集成在主板上的。 显卡驱动 显卡驱动是显卡跟计算机连接的桥梁,可以让计算机识别到GPU硬件,是必须正确安装的,不同厂商、不同型号的GPU对应不同的显卡驱动。 非开发人员不用安装CUDA或cuDNN,但是一定要安装显卡驱动。 查看Nvidia显卡和显卡驱动版本信息: nvidia-smi 显存 又叫帧缓存,作用是用来存储GPU处理过或者即将提取的渲染数据,显存相对于GPU相当于内存对于CPU。 GPU Graphics Processing Unit,图形处理单元,GPU是显卡上的一块芯片,也是显卡的核心设备,GPU和显卡的关系类似于CPU和主板的关系。 早期的GPU主要用于图形渲染,可以让电脑显示更加逼真、细节更加丰富的图形图像,牛逼的GPU可以跑大型3D游戏显示流畅不卡顿,这也是人们对GPU或显卡的最初认识。 后来人们发现GPU还可以做更多的工作,例如执行矩阵运算和浮点运算,特别是可以用来加速神经网络模型的训练,GPU也在并行计算这条路上越走越远。可以说GPU让人工智能有了更多可能。

使用Redshift渲染器,怎么选电脑配置!

被刻印的时光 ゝ 提交于 2021-01-10 03:37:44
碎碎念: 最近拿到一个工业模型渲染测试,它有三千六百万个三角面,内存使用了近16G。家里电脑显卡是GTX960,才2G显存,完全不够用啊。稍微加点模型进去,就超显存不渲了,很是揪心,谁让我选了redshift渲呢。。但不可否认它确实出图很快啊。 为了测试继续进行下去,我不断优化场景,尽可能的删掉看不到物体。最终模型三角面减到了两千百万,才勉强能够渲成图。小心翼翼调整采样,都不敢多加半点参数。门都不敢打开,糟心的测试,真是够够了。贴一张渲染出来的图留念一下吧! 这只是别人发给我的测试,可能不是西门子的东西,自己瞎放上去的。如有不合适,请联系我删掉。。 恩!回归到正题,要跟上时代潮流家里就得换电脑了。于是我研读了一下redshift的对于硬件的需求。 划重点: 游戏显卡跟专业显卡在redshift渲染性能上没差异。windows系统选专业显卡会快。 1GB的显存中容纳大约2千万到3千3百万个三角面。算算自己的需求量吧。 单显卡,显存越高越好啦,渲染会更快。 多个GPU的显存不能叠加在一起用。如果是多显卡一起使用,可以一次渲多帧,会节约渲染时间。 CPU单核高GHz是更好的选择。 至少安装比GPU大两倍的内存。 本地缓存纹理文件夹所在的磁盘,使用固态硬盘会加快读取速度。 网络读取速度会影响渲染速度,请用大量代理文件(例如3000w面的代理)测试本地路径和网络路径渲染时间差异。

YOLOv5模型训练

試著忘記壹切 提交于 2021-01-07 09:37:02
软硬件环境 ubuntu 18.04 64bit anaconda with 3.7 nvidia gtx 1070Ti cuda 10.1 pytorch 1.5 YOLOv5 YOLOv5环境配置 请参考之前的文章,YOLOv5目标检测 使用COCO数据集 YOLOv5 的预训练模型是基于 COCO 数据集,如果自己想去复现下训练过程,可以依照下面的命令 $ python train.py --data coco.yaml --cfg yolov5s.yaml --weights '' --batch-size 64 yolov5m 48 yolov5l 32 yolov5x 16 COCO 的数据集可以通过 data 文件夹下 get_coco2017.sh 脚本进行下载,包含图片和 lable 文件。 COCO 的数据集实在是太大了,整个压缩包有18G,考虑到自己到的网速还有机器的算力,还是洗洗睡吧。。。 制作自己的数据集 如果没有对应目标的公开数据集,那就只有自己出手收集了,图片到手后,接下来就是艰辛的打标签工作了,这里使用工具 LabelImg ,下载地址是 https://github.com/tzutalin/labelImg/releases/tag/v1.8.1 LabelImg 使用 Qt 做了图形化的界面,操作还是很方便的,这也是选择它的理由,它提供了默认的

首次学习BERT的pytroch中遇到的问题-句子特征提取

旧城冷巷雨未停 提交于 2021-01-06 06:07:29
参考链接: https://blog.csdn.net/weixin_41519463/article/details/100863313 import torch import torch.nn as nn from pytorch_transformers import BertModel, BertConfig,BertTokenizer # 使用gpu device0 = torch.device( " cuda " if torch.cuda.is_available() else " cpu " ) # 输入处理 tokenizer = BertTokenizer.from_pretrained( ' bert-base-uncased ' ) # 从预训练模型中加载tokenizer # text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"#开始结束标记 # tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词 # indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#词在预训练词表中的索引列表 # segments_ids = [0,

Thrust scan of just one class member

可紊 提交于 2021-01-05 12:00:07
问题 I have a custom class myClass which has members weight and config . I'd like to run an inclusive scan on a bunch of myClass es, but only on the weight s. Basically what I want is to take: [ {configA, weightA}, {configB, weightB}, {configC, weightC}, ...] to: [ {configA, weightA}, {configB, weight A + weightB}, {configC, weight A + weight B + weightC}, ...] Is there a simple way to do this using Thrust's fancy iterators? Since the binaryOp is required to be associative, I don't see how to do

CUDA: Atomic operations on unsigned chars

霸气de小男生 提交于 2021-01-03 02:13:13
问题 I'm a CUDA beginner. I have a pixel buffer of unsigned chars in global memory that can and is updated by any and all threads. To avoid weirdness in the pixel values, therefore, I want to perform an atomicExch when a thread attempts to update one. But the programming guide says that this function only works on 32- or 64-bit words, whereas I just want to atomically exchange one 8-bit byte. Is there a way to do this? Thanks. 回答1: You might implement a critical section using a mutex variable. So

CUDA: Atomic operations on unsigned chars

拟墨画扇 提交于 2021-01-03 02:00:09
问题 I'm a CUDA beginner. I have a pixel buffer of unsigned chars in global memory that can and is updated by any and all threads. To avoid weirdness in the pixel values, therefore, I want to perform an atomicExch when a thread attempts to update one. But the programming guide says that this function only works on 32- or 64-bit words, whereas I just want to atomically exchange one 8-bit byte. Is there a way to do this? Thanks. 回答1: You might implement a critical section using a mutex variable. So

CUDA: Atomic operations on unsigned chars

爷,独闯天下 提交于 2021-01-03 01:59:39
问题 I'm a CUDA beginner. I have a pixel buffer of unsigned chars in global memory that can and is updated by any and all threads. To avoid weirdness in the pixel values, therefore, I want to perform an atomicExch when a thread attempts to update one. But the programming guide says that this function only works on 32- or 64-bit words, whereas I just want to atomically exchange one 8-bit byte. Is there a way to do this? Thanks. 回答1: You might implement a critical section using a mutex variable. So

How to generalize fast matrix multiplication on GPU using numba

梦想与她 提交于 2021-01-01 10:20:11
问题 Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the