gpu | 易学教程

Transferring textures across adapters in DirectX 11

阅读更多关于 Transferring textures across adapters in DirectX 11

问题 I'm capturing the desktop with the Desktop Duplication API from one GPU and need to copy the texture (which is in GPU memory) to another GPU. To do this I have a capture thread that acquires the desktop image then copies it to a staging resource (created on the same device) using ID3D11DeviceContext::CopyResource. I then map that staging resource with Read, map the destination dynamic resource (which was created on the other device) with WriteDiscard and copy the data. On the rendering thread

Struggling with intuition regarding how warp-synchronous thread execution works

阅读更多关于 Struggling with intuition regarding how warp-synchronous thread execution works

问题 I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code: __global__ void Reduction2_kernel( int *out, const int *in, size_t N ) { extern __shared__ int sPartials[]; int sum = 0; const int tid = threadIdx.x; for ( size_t i = blockIdx.x*blockDim.x + tid; i < N; i += blockDim.x*gridDim.x ) { sum += in[i]; } sPartials[tid] = sum; __syncthreads(); for ( int activeThreads = blockDim.x>>1;

OpenCL: basic questions about SIMT execution model

阅读更多关于 OpenCL: basic questions about SIMT execution model

问题 Some of the concepts and designs of the "SIMT" architecture are still unclear to me. From what I've seen and read, diverging code paths and if() altogether are a rather bad idea, because many threads might execute in lockstep. Now what does that exactly mean? What about something like: kernel void foo(..., int flag) { if (flag) DO_STUFF else DO_SOMETHING_ELSE } The parameter "flag" is the same for all work units and the same branch is taken for all work units. Now, is a GPU going to execute

performance - drawing many 2d circles in opengl

阅读更多关于 performance - drawing many 2d circles in opengl

问题 I am trying to draw large numbers of 2d circles for my 2d games in opengl. They are all the same size and have the same texture. Many of the sprites overlap. What would be the fastest way to do this? an example of the kind of effect I'm making http://img805.imageshack.us/img805/6379/circles.png (It should be noted that the black edges are just due to the expanding explosion of circles. It was filled in a moment after this screen-shot was taken. At the moment I am using a pair of textured

CUDA performance of atomic operation on different address in warp

阅读更多关于 CUDA performance of atomic operation on different address in warp

问题 To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower. But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation? My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS

Get temperature from NVidia GPU using NVAPI

阅读更多关于 Get temperature from NVidia GPU using NVAPI

问题 I have been trying for the last few days to get the temperature of my GPU using C++ using the NVAPI i have the following code #include "stdafx.h" #include "nvapi.h" int _tmain(int argc, _TCHAR* argv[]) { NvAPI_Status ret = NVAPI_OK; int i=0; NvDisplayHandle hDisplay_a[NVAPI_MAX_PHYSICAL_GPUS*2] = {0}; ret = NvAPI_Initialize(); if (!ret == NVAPI_OK){ NvAPI_ShortString string; NvAPI_GetErrorMessage(ret, string); printf("NVAPI NvAPI_Initialize: %s\n", string); } NvAPI_ShortString ver; NvAPI

Device memory flush cuda

阅读更多关于 Device memory flush cuda

问题 I'm running a C program where I call twice a cuda host function. I want to clean up the device memory between these 2 calls. Is there a way I can flush GPU device memory?? I'm on a Tesla M2050 with computing capability of 2.0 回答1: If you only want to zero the memory, then cudaMemset is probably the simplest way to do this. For example: const int n = 10000000; const int sz = sizeof(float) * n; float *devicemem; cudaMalloc((void **)&devicemem, sz); kernel<<<...>>>(devicemem,....); cudaMemset

CUDA-transfer 2D array from host to device

阅读更多关于 CUDA-transfer 2D array from host to device

问题 I have a 2D matrix in the main. I want to transfer if from host to device. Can you tell me how I can allocate memory for it and transfer it to the device memory? #define N 5 __global__ void kernel(int a[N][N]){ } int main(void){ int a[N][N]; cudaMalloc(?); cudaMemcpy(?); kernel<<<N,N>>>(?); } 回答1: Perhaps something like this is what you really had in mind: #define N 5 __global__ void kernel(int *a) { // Thread indexing within Grid - note these are // in column major order. int tidx =

Jupyter Notebook - GPU

阅读更多关于 Jupyter Notebook - GPU

问题 I'm working on a Jupyter Notebook and would like to make it run faster by using Google GPU. I've already made a few researches and found a solution, but it didn't work for me. The solution was: "Easiest way to do is use connect to Local Runtime then select hardware accelerator as GPU as shown in Google Colab Free GPU Tutorial." I did manage to connect googe colab to jupyter but when I then try to switch the hardware accelerator to GPU, I get disconnected from my jupyter notebook... In the

Import error when trying to import tensorflow with gpu

阅读更多关于 Import error when trying to import tensorflow with gpu

问题 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. This error is appeared when import tensorflow . I need to know steps to solve this problem. 回答1: If you are using TensorFlow with GPU, you need to install CUDA and cuDNN. Please follow instructions on https://www.tensorflow.org/install/ If you have already install CUDA and cuDNN, but still get this error, then you probably forgot to export your libraries: for