cuda | 易学教程

Processing Shared Work Queue Using CUDA Atomic Operations and Grid Synchronization

阅读更多关于 Processing Shared Work Queue Using CUDA Atomic Operations and Grid Synchronization

问题 I’m trying to write a kernel whose threads iteratively process items in a work queue. My understanding is that I should be able to do this by using atomic operations to manipulate the work queue (i.e., grab work items from the queue and insert new work items into the queue), and using grid synchronization via cooperative groups to ensure all threads are at the same iteration (I ensure the number of thread blocks doesn’t exceed the device capacity for the kernel). However, sometimes I observe

CUDA编译过程

阅读更多关于 CUDA编译过程

流程将test.cu代码进行分离，利用cudafe.exe 去分离CPU代码和GPU代码，我们可以在生成的中间文件可以看到test.cudafe1.cpp和test.cudafe1.gpu cicc.exe 将根据编译选项-arch=compute_xx将GPU代码编译成对应架构的test.ptx文件 ptxas.exe 编译 test.ptx 到test.cubin，这个是根据编译选项-code=sm_xx定义的，比如test.sm_30.cubin.(这一步叫做PTX离线编译，主要的目的是为了将代码编译成一个确定的计算能力和SM版本，对应的版本信息保存在cubin中） fatbin.exe 编译test .cubin 和test. ptx到 text.fatbin.c 。 (这一步叫PTX在线编译，是将cubin和ptx中的版本信息保存在fatbin中) 调用系统的gcc/g++将host代码(test.cudafe1.cpp)和fatbin(text.fatbin.c)编译成对应的目标文件test.o 和test._dlink.o。用c++编译器将目标文件链接起来生成可执行文件。实验 nvcc --cuda test.cu --keep --dryrun 细节 -arch=compute_XX, -code=sm_XX, 如果写两个的话必须这样来写，也就说compute

如何将Numpy加速700倍？用 CuPy 呀

阅读更多关于如何将Numpy加速700倍？用 CuPy 呀

如何将Numpy加速700倍？用 CuPy 呀作为 Python 语言的一个扩展程序库，Numpy 支持大量的维度数组与矩阵运算，为 Python 社区带来了很多帮助。借助于 Numpy，数据科学家、机器学习实践者和统计学家能够以一种简单高效的方式处理大量的矩阵数据。那么 Numpy 速度还能提升吗？本文介绍了如何利用 CuPy 库来加速 Numpy 运算速度。选自towardsdatascience，作者：George Seif，机器之心编译，参与：杜伟、张倩。就其自身来说，Numpy 的速度已经较 Python 有了很大的提升。当你发现 Python 代码运行较慢，尤其出现大量的 for-loops 循环时，通常可以将数据处理移入 Numpy 并实现其向量化最高速度处理。但有一点，上述 Numpy 加速只是在 CPU 上实现的。由于消费级 CPU 通常只有 8 个核心或更少，所以并行处理数量以及可以实现的加速是有限的。这就催生了新的加速工具——CuPy 库。何为 CuPy？ CuPy 是一个借助 CUDA GPU 库在英伟达 GPU 上实现 Numpy 数组的库。基于 Numpy 数组的实现，GPU 自身具有的多个 CUDA 核心可以促成更好的并行加速。 CuPy 接口是 Numpy 的一个镜像，并且在大多情况下，它可以直接替换 Numpy 使用。只要用兼容的

TensorFlow：升级TensorFlow2.3踩坑记录（Python）

阅读更多关于 TensorFlow：升级TensorFlow2.3踩坑记录（Python）

升级TensorFlow2.3踩坑记录（Python）前言一、CUDA版本问题二、GPU支持问题总结前言原本是使用的TensorFlow2.0，处理时间序列数据时发现一个很好用的函数： tf.keras.preprocessing.timeseries_dataset_from_array 。不料报错没有此函数，才知道这个函数要TensorFlow2.3及以上才有，于是打算升级至TensorFlow2.3，过程中踩了几个坑，记录一下。一、CUDA版本问题原本使用的是CUDA10.0，看官网信息这样显示： TensorFlow 支持 CUDA® 10.1（TensorFlow 2.1.0 及更高版本）。想着一般CUDA是向下兼容的，于是直接把CUDA升级到了11.0，以免后面还要再升级CUDA。可是当我所有环境配置好之后显示这样的错误： Could not load dynamic library ‘cudart64_101.dll’; dlerror: cudart64_101.dll not found 百思不得其解，查找资料发现CUDA版本只能是10.1，高一点都不行，遂重装。这里分享一个小技巧，我们在下载CUDA时候网速会特别慢，我们只需要将下载链接复制到迅雷用迅雷下载就很快了。二、GPU支持问题 CUDA版本问题解决了

mmdetection docker安装

阅读更多关于 mmdetection docker安装

通过镜像环境安装mmdetection 遇到的问题： 1、没有cuda 解决：在环境中指定CUDA 2、GCC版本太低解决：升级GCC到5.3版本 dockerfile 如下： FROM repo.jd.local/public/das:pytorch1.1.0-py3-gpu # 维护者信息, 或其它标签信息 LABEL maintainer=rengang@jd.com # CUDA 环境 ENV FORCE_CUDA 1 RUN yum install -y tar # RUN yum -y install centos-release-scl # RUN yum -y install devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-binutils # RUN echo "source /opt/rh/devtoolset-7/enable" >>/etc/profile # 升级GCC版本 RUN wget https://copr.fedoraproject.org/coprs/hhorak/devtoolset-4-rebuild-bootstrap/repo/epel-7/hhorak-devtoolset-4-rebuild-bootstrap-epel-7.repo -O /etc/yum.repos.d

How do I know the maximum number of threads per block in python code with either numba or tensorflow installed?

阅读更多关于 How do I know the maximum number of threads per block in python code with either numba or tensorflow installed?

问题 Is there any code in python with either numba or tensorflow installed? For example, if I would like to know the GPU memory info, I can simply use: from numba import cuda gpus = cuda.gpus.lst for gpu in gpus: with gpu: meminfo = cuda.current_context().get_memory_info() print("%s, free: %s bytes, total, %s bytes" % (gpu, meminfo[0], meminfo[1])) in numba. But I can not find any code that gives me the maximum threads per block info. I would like the code to detect the maximum number of threads

CentOS7 Nvidia Docker环境

阅读更多关于 CentOS7 Nvidia Docker环境

最近在搞tensorflow的一些东西，话说这东西是真的皮，搞不懂。但是环境还是磕磕碰碰的搭起来了其实本来是没想到用docker的，但是就一台配置较好的服务器，还要运行公司的其他环境，vmware esxi用起来太费劲，还是算了。环境：系统：CentOS7 7.4 1708 显卡：Nvidia 1080Ti 下载所有需要的东东 1、docker-ce yum repo : https://download.docker.com/linux/centos/docker-ce.repo 2、nvidia-docker yum repo : https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-docker.repo 3、nvidia cuda yum repo : http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-9.1.85-1.x86_64.rpm 4、nvidia cudnn : https://developer.nvidia.com/cudnn 这个东西需要注册nvidia账号，就不给直接下载地址了。 5、nvidia驱动 : http://www.nvidia.cn/Download

Unable to create a thrust device vector

阅读更多关于 Unable to create a thrust device vector

问题 So I'm trying to start on GPU programming and using the Thrust library to simplify things. I have created a test program to work with it and see how it works, however whenever I try to create a thrust::device_vector with non-zero size the program crashes with "Run-time Check Failure #3 - The variable 'result' is being used without being initialized.' (this comes from the allocator_traits.inl file) And... I have no idea how to fix this. The following is all that is needed to cause this error.

Unable to create a thrust device vector

阅读更多关于 Unable to create a thrust device vector

Unable to create a thrust device vector

阅读更多关于 Unable to create a thrust device vector

订阅 cuda