gpgpu | 易学教程

Does NVidia support OpenCL SPIR?

阅读更多关于 Does NVidia support OpenCL SPIR?

问题 I am wondering that whether nvidia supports spir backend or not? if yes, i couldn't find any document and sample example about that. but if not, is there a any way to work spir backend onto nvidia gpus? thanks in advance 回答1: Since SPIR builds on top of OpenCL version 1.2, and so far Nvidia has not made any OpenCL 1.2 drivers available, it is not possible to use SPIR with Nvidia GPUs. As mentioned in the comments, Nvidia has made PTX available as intermediate language (also based on LLVM IR).

Python real time image classification problems with Neural Networks

阅读更多关于 Python real time image classification problems with Neural Networks

问题 I'm attempting use caffe and python to do real-time image classification. I'm using OpenCV to stream from my webcam in one process, and in a separate process, using caffe to perform image classification on the frames pulled from the webcam. Then I'm passing the result of the classification back to the main thread to caption the webcam stream. The problem is that even though I have an NVIDIA GPU and am performing the caffe predictions on the GPU, the main thread gets slown down. Normally

Matrix-vector multiplication in CUDA: benchmarking & performance

阅读更多关于 Matrix-vector multiplication in CUDA: benchmarking & performance

问题 I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected

How to create or manipulate GPU assembler?

阅读更多关于 How to create or manipulate GPU assembler?

问题 Does any one have experience in creating/manipulating GPU machine code, possibly at run-time? I am interested in modifying GPU assembler code, possibly at run time with minimal overhead. Specifically I'm interested in assembler based genetic programming. I understand ATI has released ISAs for some of their cards, and nvidia recently released a disassembler for CUDA for older cards, but I am not sure if it is possible to modify instructions in memory at runtime or even before hand. Is this

CUDA Block and Grid size efficiencies

阅读更多关于 CUDA Block and Grid size efficiencies

问题 What is the advised way of dealing with dynamically-sized datasets in cuda? Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill? I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's

cuBLAS argmin — segfault if outputing to device memory?

阅读更多关于 cuBLAS argmin — segfault if outputing to device memory?

问题 In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART

Doing readback from Direct3D textures and surfaces

阅读更多关于 Doing readback from Direct3D textures and surfaces

问题 I need to figure out how to get the data from D3D textures and surfaces back to system memory. What's the fastest way to do such things and how? Also if I only need one subrect, how can one read back only that portion without having to read back the entire thing to system memory? In short I'm looking for concise descriptions of how to copy the following to system memory : a texture a subset of a texture a surface a subset of a surface a D3DUSAGE_RENDERTARGET texture a subset of a D3DUSAGE

Modifying registry to increase GPU timeout, windows 7

阅读更多关于 Modifying registry to increase GPU timeout, windows 7

问题 Im trying to increase the timeout on the GPU from its default setting of 2 seconds to something a little longer. I found the following link but it appears its slightly different in windows 7 as i cant see anything mentioned in the webpage. Has anyone done this before? If so could you fill in the gaps please. Thanks @RoBik so as follows if i want 6 days (bit excessive i know but just for example)? Thanks again for your help, +1. EDIT This is the error im currently getting. An error has occured

Concurrent Kernel Launch Example - CUDA

阅读更多关于 Concurrent Kernel Launch Example - CUDA

问题 I'm attempting to implement concurrent kernel launches for a very complex CUDA kernel, so I thought I'd start out with a simple example. It just launches a kernel which does a sum reduction. Simple enough. Here it is: #include <stdlib.h> #include <stdio.h> #include <time.h> #include <cuda.h> extern __shared__ char dsmem[]; __device__ double *scratch_space; __device__ double NDreduceSum(double *a, unsigned short length) { const int tid = threadIdx.x; unsigned short k = length; double *b; b =

Dot Product in CUDA using atomic operations - getting wrong results

阅读更多关于 Dot Product in CUDA using atomic operations - getting wrong results

问题 I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following: #include <stdio.h> #define N (2048 * 8) #define THREADS_PER_BLOCK 512 #define num_t float // The kernel - DOT PRODUCT __global__ void dot(num_t *a, num_t *b, num_t *c) { __shared__ num_t temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); //Synchronize! *c = 0