gpu | 易学教程

Memory coalescing and nvprof results on NVIDIA Pascal

阅读更多关于 Memory coalescing and nvprof results on NVIDIA Pascal

问题 I am running a memory coalescing experiment on Pascal and getting unexpected nvprof results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof reports confusing numbers for gld_transactions_per_request and gst_transactions_per_request . I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results. #include <stdio.h> #include <cstdint> #include <assert.h> #define N 1ULL*1024*1024*1024 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__);

stdatomic.h not found, for use in swift & metal compute shader

阅读更多关于 stdatomic.h not found, for use in swift & metal compute shader

问题 I'm trying to use a struct with an atomic_int for use in a metal compute shader. However, it says I need to import #include "stdatomic.h" - but every time I try, it can't find the file. #include "stdatomic.h" // 'stdatomic.h' file not found I'm trying to build my application for macOS Catalina struct Fitness { atomic_int weight; // Declaration of 'atomic_int' must be imported from module 'Darwin.C.stdatomic' before it is required ...others... }; I have tried placing a copy of stdatomic.h into

CUDA: Why Thrust is so slow on uploading data to GPU?

阅读更多关于 CUDA: Why Thrust is so slow on uploading data to GPU?

问题 I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is? Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64 GPU Bandwidth test: It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't! Here's the test program: #include <iostream> #include

The behavior of stream 0 (default) and other streams

阅读更多关于 The behavior of stream 0 (default) and other streams

问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

Access GPU hardware specifications in Python?

阅读更多关于 Access GPU hardware specifications in Python?

问题 I want to access various NVidia GPU specifications using Numba or a similar Python CUDA pacakge. Information such as available device memory, L2 cache size, memory clock frequency, etc. From reading this question, I learned I can access some of the information (but not all) through Numba's CUDA device interface. from numba import cuda device = cuda.get_current_device() attribs = [s for s in dir(device) if s.isupper()] for attr in attribs: print(attr, '=', getattr(device, attr)) Output on a

Cannot import multi_gpu_model from keras.utils

阅读更多关于 Cannot import multi_gpu_model from keras.utils

问题 I have tensorflow-gpu 1.2.1 and keras on ubuntu 16.04. I am not able to perform: from kears.utils import multi_gpu_model Has anyone had success with multi_gpu_model as described in their documentation's FAQ section? I have a 4 GPU machine with 4 GeForce GTX 1080 Ti cards and want to use all of them. Here's the error I get: import keras.utils.multi_gpu_model --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last)

switch cuda compute mode to default mode

阅读更多关于 switch cuda compute mode to default mode

问题 I use nvidia-smi to see the status of each GPU on a computing node but find one of them is E. Thread . Is there any easy way to switch it back to default mode? ------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |======================

login password required to access jupyter notebook running in nvidia-docker container

阅读更多关于 login password required to access jupyter notebook running in nvidia-docker container

问题 I run this command in the following order in order to run tensoflow in docker container after successful installation in Ubuntu 16.04 (NVIDIA GPU GeFORCE 840M) . 1.sudo service docker start 2.sudo nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu Then I try to access jupyter in firefox browser by typing localhost:8888 and I am asked to enter the login password in the browser. What is the solution? 回答1: add option "-e PASSWORD=password" to set the environment variable.

GPU-based search for all possible paths between two nodes on a graph

阅读更多关于 GPU-based search for all possible paths between two nodes on a graph

问题 My work makes extensive use of the algorithm by Migliore, Martorana and Sciortino for finding all possible simple paths, i.e. ones in which no node is encountered more than once, in a graph as described in: An Algorithm to find All Paths between Two Nodes in a Graph. (Although this algorithm is essentially a depth-first search and intuitively recursive in nature, the authors also present a non-recursive, stack-based implementation.) I'd like to know if such an algorithm can be implemented on

BLAS equivalent of a LAPACK function for GPUs

阅读更多关于 BLAS equivalent of a LAPACK function for GPUs

问题 In LAPACK there is this function for diagonalization SUBROUTINE DSPGVX( ITYPE, JOBZ, RANGE, UPLO, N, AP, BP, VL, VU, $ IL, IU, ABSTOL, M, W, Z, LDZ, WORK, IWORK, $ IFAIL, INFO ) * I am looking for its GPU implementation. I am trying to find whether this function has been already implemented in CUDA (or OpenCL), but have only found CULA, which is not open source. Therefore and side CUBLAS exists, I wonder how could I know whether a BLAS or CUBLAS equivalent of this subroutine is available. 回答1