cuda

How can I pull/push data between gpu and cpu in tensorflow

夙愿已清 提交于 2021-02-08 11:27:53
问题 I used a temporary tensor to store data in my customized gpu-based op. For debug purpose, I want to print the data of this tensor by traditional printf inside C++. How can I pull this gpu-based tensor to cpu and then print its contents. Thank you very much. 回答1: If by temporary you mean allocate_temp instead of allocate_output , there is no way of fetching the data on the python side. I usually return the tensor itself during debugging so that a simple sess.run fetches the result. Otherwise,

Getting started with shared memory on PyCUDA

大城市里の小女人 提交于 2021-02-08 10:35:59
问题 I'm trying to understand shared memory by playing with the following code: import pycuda.driver as drv import pycuda.tools import pycuda.autoinit import numpy from pycuda.compiler import SourceModule src=''' __global__ void reduce0(float *g_idata, float *g_odata) { extern __shared__ float sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do

Memory coalescing and nvprof results on NVIDIA Pascal

北城余情 提交于 2021-02-08 10:16:31
问题 I am running a memory coalescing experiment on Pascal and getting unexpected nvprof results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof reports confusing numbers for gld_transactions_per_request and gst_transactions_per_request . I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results. #include <stdio.h> #include <cstdint> #include <assert.h> #define N 1ULL*1024*1024*1024 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__);

how to get cufftcomplex magnitude and phase fast

落花浮王杯 提交于 2021-02-08 10:07:59
问题 i have a cufftcomplex data block which is the result from cuda fft(R2C). i know the data is save as a structure with a real number followed by image number. now i want to get the amplitude=sqrt(R*R+I*I), and phase=arctan(I/R) of each complex element by a fast way(not for loop). Is there any good way to do that? or any library could do that? 回答1: Since cufftExecR2C operates on data that is on the GPU, the results are already on the GPU, (before you copy them back to the host, if you are doing

how to get cufftcomplex magnitude and phase fast

醉酒当歌 提交于 2021-02-08 10:04:56
问题 i have a cufftcomplex data block which is the result from cuda fft(R2C). i know the data is save as a structure with a real number followed by image number. now i want to get the amplitude=sqrt(R*R+I*I), and phase=arctan(I/R) of each complex element by a fast way(not for loop). Is there any good way to do that? or any library could do that? 回答1: Since cufftExecR2C operates on data that is on the GPU, the results are already on the GPU, (before you copy them back to the host, if you are doing

How to link the libraries when executing CUDA program on Google Colab?

故事扮演 提交于 2021-02-08 09:48:20
问题 I'm trying to run CUDA program to generate random numbers by using cuRAND library on Google Colab but I am getting a linker issue. I know,we can fix this by using -lcurand while compiling with nvcc, but as far as I know, we cannot access terminal in colab . I'm using this to generate 2*N random numbers. #include <curand_kernel.h> int status; curandGenerator_t gen; status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A); status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL

CUDA: Why Thrust is so slow on uploading data to GPU?

梦想与她 提交于 2021-02-08 09:33:32
问题 I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is? Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64 GPU Bandwidth test: It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't! Here's the test program: #include <iostream> #include

The behavior of stream 0 (default) and other streams

旧巷老猫 提交于 2021-02-08 09:15:42
问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

PyTorch----安装

余生长醉 提交于 2021-02-08 08:50:13
由于PyTorch官网 http://pytorch.org/ 推荐使用Anaconda作为软件包管理器(package manager),所以我们需要首先准备好Anaconda环境。 1.准备好Anaconda环境 具体参见: http://blog.csdn.net/zhdgk19871218/article/details/46502637 2.进入PyTorch官网,选择自己相应的版本 查看python版本 1 fenglei@gpu01:~$ python - V 2 Python 3.5.2 :: Anaconda 4.2.0 (64-bit) 查看cuda版本 1 fenglei@gpu01:~$ cat /usr/local/cuda/ version.txt 2 CUDA Version 8.0.61 注意: 确保拥有最新的pip 和 numpy packages. 1 fenglei@gpu01:~$ pip install -- upgrade pip 2 Collecting pip 3 Downloading pip-9.0.1-py2.py3-none-any.whl (1 .3MB) 4 100% |████████████████████████████████| 1.3MB 13kB/ s 5 Installing collected

copying host memory to cuda __device__ variable

╄→гoц情女王★ 提交于 2021-02-08 08:47:10
问题 i've tried to find a solution to my problem using google but failed. there were a lot of snippets that didn't fit my case exactly, although i would think that it's a pretty standard situation. I'll have to transfer several different data arrays to cuda. all of them being simple struct arrays with dynamic size. since i don't want to put everything into the cuda kernel call, i thought, that __device__ variables should be exactly what i need. this is how i tried to copy my host data to the _