cuda | 易学教程

10条PyTorch避坑指南

阅读更多关于 10条PyTorch避坑指南

点击上方“ 视学算法 ”，选择加" 星标 " 重磅干货，第一时间送达本文转载自：机器之心 | 作者：Eugene Khvedchenya 参与：小舟、蛋酱、魔王高性能 PyTorch 的训练管道是什么样的？是产生最高准确率的模型？是最快的运行速度？是易于理解和扩展？还是容易并行化？答案是，包括以上提到的所有。如何用最少的精力，完成最高效的 PyTorch 训练？一位有着 PyTorch 两年使用经历的 Medium 博主最近分享了他在这方面的 10 个真诚建议。 ‍ 在 Efficient PyTorch 这一部分中，作者提供了一些识别和消除 I/O 和 CPU 瓶颈的技巧。第二部分阐述了一些高效张量运算的技巧，第三部分是在高效模型上的 debug 技巧。在阅读这篇文章之前，你需要对 PyTorch 有一定程度的了解。好吧，从最明显的一个开始：建议 0：了解你代码中的瓶颈在哪里命令行工具比如 nvidia-smi、htop、iotop、nvtop、py-spy、strace 等，应该成为你最好的伙伴。你的训练管道是否受 CPU 约束？IO 约束？GPU 约束？这些工具将帮你找到答案。这些工具你可能从未听过，即使听过也可能没用过。没关系。如果你不立即使用它们也可以。只需记住，其他人可能正在用它们来训练模型，速度可能会比你快 5%、10%、15%-……

What does #pragma unroll do exactly? Does it affect the number of threads?

阅读更多关于 What does #pragma unroll do exactly? Does it affect the number of threads?

问题 I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique __global__ void kernel(float *b, int size) { int tid = blockDim.x * blockIdx.x + threadIdx.x; #pragma unroll for(int i=0;i<size;i++) b[i]=i; } Above is my kernel function. In main I call it like below int main() { float * a; //host array float * b; //device array int size=100; a=(float*)malloc(size*sizeof(float)); cudaMalloc((float**)&b,size); cudaMemcpy(b, a, size,

What does #pragma unroll do exactly? Does it affect the number of threads?

阅读更多关于 What does #pragma unroll do exactly? Does it affect the number of threads?

高性能PyTorch是如何炼成的？过来人吐血整理的10条避坑指南

阅读更多关于高性能PyTorch是如何炼成的？过来人吐血整理的10条避坑指南

选自towardsdatascience 作者：Eugene Khvedchenya 机器之心编译参与：小舟、蛋酱、魔王高性能 PyTorch 的训练管道是什么样的？是产生最高准确率的模型？是最快的运行速度？是易于理解和扩展？还是容易并行化？答案是，包括以上提到的所有。如何用最少的精力，完成最高效的 PyTorch 训练？一位有着 PyTorch 两年使用经历的 Medium 博主最近分享了他在这方面的 10 个真诚建议。 ‍ 在 Efficient PyTorch 这一部分中，作者提供了一些识别和消除 I/O 和 CPU 瓶颈的技巧。第二部分阐述了一些高效张量运算的技巧，第三部分是在高效模型上的 debug 技巧。在阅读这篇文章之前，你需要对 PyTorch 有一定程度的了解。好吧，从最明显的一个开始：建议 0：了解你代码中的瓶颈在哪里命令行工具比如 nvidia-smi、htop、iotop、nvtop、py-spy、strace 等，应该成为你最好的伙伴。你的训练管道是否受 CPU 约束？IO 约束？GPU 约束？这些工具将帮你找到答案。这些工具你可能从未听过，即使听过也可能没用过。没关系。如果你不立即使用它们也可以。只需记住，其他人可能正在用它们来训练模型，速度可能会比你快 5%、10%、15%-…… 最终可能会导致面向市场或者工作机会时候的不同结果。

Set CXX-standard to c++17 when combining C++ and CUDA in CMakeLists

阅读更多关于 Set CXX-standard to c++17 when combining C++ and CUDA in CMakeLists

问题 According to the documentation of CMake I just have to write project(${PROJECT_NAME} LANGUAGES CUDA CXX) when I would like to combine CUDA-files and native C++-files in one project. Then I do not have to call cuda_add_executable() anymore, but rather add_executable , and CMake should figure out everything on its own. This works fine, unless I would like to specify a standard for C++-code (by using set(CMAKE_CXX_STANDARD 17) ). Then I get the error message Target requires the language dialect

Set CXX-standard to c++17 when combining C++ and CUDA in CMakeLists

阅读更多关于 Set CXX-standard to c++17 when combining C++ and CUDA in CMakeLists

私有云如何运行深度学习？看ZStack+Docker支撑GPU业务实践

阅读更多关于私有云如何运行深度学习？看ZStack+Docker支撑GPU业务实践

前景 ZStack所聚焦的IaaS，作为云计算里的底座基石，能够更好的实现物理资源隔离，以及服务器等硬件资源的统一管理，为上层大数据、深度学习Tensorflow等业务提供了稳定可靠的基础环境。近年来，云计算发展探索出了有别于传统虚拟化、更贴近于业务的PaaS型服务，该类型依赖于docker实现，如K8S等典型的容器云，可以直接从镜像商店下载封装好业务软件的镜像，更加快捷地实现业务部署。此外，GPU场景也是客户业务的典型场景，相比于CPU的运算特点，在数据分析、深度学习有着明显的优势。 ZStack是如何与容器结合，以IaaS+PaaS的组合拳，为上层业务提供支撑的呢？本篇文章带大家了解一下，如何在ZStack 上部署 centos7.6 虚拟机，在虚拟机里部署docker，以及如何使用nvidia-docker实现在容器里调用GPU的业务场景。环境虚机系统：Centos 7.6 虚机内核：Linux 172-18-47-133 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux docker版本：docker-ce 19.03 nvidia-docker版本：nvidia-docker-1.0.11.x86_64 显卡：RTX6000 Cuda版本

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

阅读更多关于 Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

问题 I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically. The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms. Question: cuda atomicAdd example fails to yield correct output I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically. I would

PyCUDA: Pow within device code tries to use std::pow, fails

阅读更多关于 PyCUDA: Pow within device code tries to use std::pow, fails

问题 Question more or less says it all. calling a host function("std::pow<int, int> ") from a __device__/__global__ function("_calc_psd") is not allowed from my understanding, this should be using the cuda pow function instead, but it isn't. 回答1: The error is exactly as the compiler is reported. You can't used host functions in device code, and that include the whole host C++ std library. CUDA includes its own standard library, described in the programming guide, but you should use either pow or

Ubuntu系统---nvidia驱动下载之问题

阅读更多关于 Ubuntu系统---nvidia驱动下载之问题

Ubuntu系统---nvidia驱动下载之问题　　百度“英伟达驱动下载”， NVIDIA 驱动程序下载，https://www.nvidia.cn/Download/index.aspx?lang=cn，想知道GRD 、SD的区别，参考了几篇资料。英伟达驱动，要与电脑的GPU型号相对应。可以单独安装，也可以不单独安装，在安装CUDA的时候，一起安装。转：时问实答：下载NVIDIA显卡驱动总共分几步? @http://m.sohu.com/a/304501427_120099893 驱动程序，为显卡在使用时提供了必不可少的软件支持。对于游戏玩家们而言，新版显卡驱动经常会对最新的游戏进行适配和优化，因此升级显卡驱动也是玩家们日常的基本操作之一。不过近期很多用户发现，在英伟达官网的驱动下载页面中，增加了很多额外的选项，比如 “驱动程序类型”和“下载类型” 等。在本期的《时问实答》中，我们将针对玩家在下载N卡驱动时可能遇到的问题进行集中的讨论。问：下载N卡驱动的途径有哪些？答：目前玩家们可以在英伟达官网、第三方分流网站，以及GFE软件中下载到 N卡驱动。不过，其实这几个途径的更新速度是不同的。一般情况下，第三方网站的更新速度最慢，因为他们是以英伟达官网的更新为基准。而即便是英伟达官方，也提供了两个驱动搜索入口。如上图所示，在官网首页的“驱动程序”标签中

订阅 cuda