gpu

Memory coalescing and nvprof results on NVIDIA Pascal

北城余情 提交于 2021-02-08 10:16:31
问题 I am running a memory coalescing experiment on Pascal and getting unexpected nvprof results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof reports confusing numbers for gld_transactions_per_request and gst_transactions_per_request . I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results. #include <stdio.h> #include <cstdint> #include <assert.h> #define N 1ULL*1024*1024*1024 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__);

stdatomic.h not found, for use in swift & metal compute shader

青春壹個敷衍的年華 提交于 2021-02-08 10:10:55
问题 I'm trying to use a struct with an atomic_int for use in a metal compute shader. However, it says I need to import #include "stdatomic.h" - but every time I try, it can't find the file. #include "stdatomic.h" // 'stdatomic.h' file not found I'm trying to build my application for macOS Catalina struct Fitness { atomic_int weight; // Declaration of 'atomic_int' must be imported from module 'Darwin.C.stdatomic' before it is required ...others... }; I have tried placing a copy of stdatomic.h into

CUDA: Why Thrust is so slow on uploading data to GPU?

梦想与她 提交于 2021-02-08 09:33:32
问题 I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is? Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64 GPU Bandwidth test: It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't! Here's the test program: #include <iostream> #include

The behavior of stream 0 (default) and other streams

旧巷老猫 提交于 2021-02-08 09:15:42
问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

Access GPU hardware specifications in Python?

随声附和 提交于 2021-02-08 08:31:35
问题 I want to access various NVidia GPU specifications using Numba or a similar Python CUDA pacakge. Information such as available device memory, L2 cache size, memory clock frequency, etc. From reading this question, I learned I can access some of the information (but not all) through Numba's CUDA device interface. from numba import cuda device = cuda.get_current_device() attribs = [s for s in dir(device) if s.isupper()] for attr in attribs: print(attr, '=', getattr(device, attr)) Output on a

Cannot import multi_gpu_model from keras.utils

狂风中的少年 提交于 2021-02-08 03:28:05
问题 I have tensorflow-gpu 1.2.1 and keras on ubuntu 16.04. I am not able to perform: from kears.utils import multi_gpu_model Has anyone had success with multi_gpu_model as described in their documentation's FAQ section? I have a 4 GPU machine with 4 GeForce GTX 1080 Ti cards and want to use all of them. Here's the error I get: import keras.utils.multi_gpu_model --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last)

switch cuda compute mode to default mode

橙三吉。 提交于 2021-02-07 23:12:12
问题 I use nvidia-smi to see the status of each GPU on a computing node but find one of them is E. Thread . Is there any easy way to switch it back to default mode? ------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |======================

login password required to access jupyter notebook running in nvidia-docker container

為{幸葍}努か 提交于 2021-02-07 21:51:12
问题 I run this command in the following order in order to run tensoflow in docker container after successful installation in Ubuntu 16.04 (NVIDIA GPU GeFORCE 840M) . 1.sudo service docker start 2.sudo nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu Then I try to access jupyter in firefox browser by typing localhost:8888 and I am asked to enter the login password in the browser. What is the solution? 回答1: add option "-e PASSWORD=password" to set the environment variable.

GPU-based search for all possible paths between two nodes on a graph

牧云@^-^@ 提交于 2021-02-07 17:23:31
问题 My work makes extensive use of the algorithm by Migliore, Martorana and Sciortino for finding all possible simple paths, i.e. ones in which no node is encountered more than once, in a graph as described in: An Algorithm to find All Paths between Two Nodes in a Graph. (Although this algorithm is essentially a depth-first search and intuitively recursive in nature, the authors also present a non-recursive, stack-based implementation.) I'd like to know if such an algorithm can be implemented on

BLAS equivalent of a LAPACK function for GPUs

半城伤御伤魂 提交于 2021-02-07 15:17:26
问题 In LAPACK there is this function for diagonalization SUBROUTINE DSPGVX( ITYPE, JOBZ, RANGE, UPLO, N, AP, BP, VL, VU, $ IL, IU, ABSTOL, M, W, Z, LDZ, WORK, IWORK, $ IFAIL, INFO ) * I am looking for its GPU implementation. I am trying to find whether this function has been already implemented in CUDA (or OpenCL), but have only found CULA, which is not open source. Therefore and side CUBLAS exists, I wonder how could I know whether a BLAS or CUBLAS equivalent of this subroutine is available. 回答1