gpgpu

Does NVidia support OpenCL SPIR?

眉间皱痕 提交于 2019-12-30 08:27:17
问题 I am wondering that whether nvidia supports spir backend or not? if yes, i couldn't find any document and sample example about that. but if not, is there a any way to work spir backend onto nvidia gpus? thanks in advance 回答1: Since SPIR builds on top of OpenCL version 1.2, and so far Nvidia has not made any OpenCL 1.2 drivers available, it is not possible to use SPIR with Nvidia GPUs. As mentioned in the comments, Nvidia has made PTX available as intermediate language (also based on LLVM IR).

Python real time image classification problems with Neural Networks

吃可爱长大的小学妹 提交于 2019-12-30 01:36:14
问题 I'm attempting use caffe and python to do real-time image classification. I'm using OpenCV to stream from my webcam in one process, and in a separate process, using caffe to perform image classification on the frames pulled from the webcam. Then I'm passing the result of the classification back to the main thread to caption the webcam stream. The problem is that even though I have an NVIDIA GPU and am performing the caffe predictions on the GPU, the main thread gets slown down. Normally

Matrix-vector multiplication in CUDA: benchmarking & performance

烈酒焚心 提交于 2019-12-29 04:00:23
问题 I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected

How to create or manipulate GPU assembler?

◇◆丶佛笑我妖孽 提交于 2019-12-29 03:10:07
问题 Does any one have experience in creating/manipulating GPU machine code, possibly at run-time? I am interested in modifying GPU assembler code, possibly at run time with minimal overhead. Specifically I'm interested in assembler based genetic programming. I understand ATI has released ISAs for some of their cards, and nvidia recently released a disassembler for CUDA for older cards, but I am not sure if it is possible to modify instructions in memory at runtime or even before hand. Is this

CUDA Block and Grid size efficiencies

北城余情 提交于 2019-12-29 03:10:07
问题 What is the advised way of dealing with dynamically-sized datasets in cuda? Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill? I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's

cuBLAS argmin — segfault if outputing to device memory?

谁说胖子不能爱 提交于 2019-12-29 01:40:09
问题 In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART

Doing readback from Direct3D textures and surfaces

六眼飞鱼酱① 提交于 2019-12-28 12:08:28
问题 I need to figure out how to get the data from D3D textures and surfaces back to system memory. What's the fastest way to do such things and how? Also if I only need one subrect, how can one read back only that portion without having to read back the entire thing to system memory? In short I'm looking for concise descriptions of how to copy the following to system memory : a texture a subset of a texture a surface a subset of a surface a D3DUSAGE_RENDERTARGET texture a subset of a D3DUSAGE

Modifying registry to increase GPU timeout, windows 7

廉价感情. 提交于 2019-12-27 11:45:42
问题 Im trying to increase the timeout on the GPU from its default setting of 2 seconds to something a little longer. I found the following link but it appears its slightly different in windows 7 as i cant see anything mentioned in the webpage. Has anyone done this before? If so could you fill in the gaps please. Thanks @RoBik so as follows if i want 6 days (bit excessive i know but just for example)? Thanks again for your help, +1. EDIT This is the error im currently getting. An error has occured

Concurrent Kernel Launch Example - CUDA

扶醉桌前 提交于 2019-12-25 07:11:08
问题 I'm attempting to implement concurrent kernel launches for a very complex CUDA kernel, so I thought I'd start out with a simple example. It just launches a kernel which does a sum reduction. Simple enough. Here it is: #include <stdlib.h> #include <stdio.h> #include <time.h> #include <cuda.h> extern __shared__ char dsmem[]; __device__ double *scratch_space; __device__ double NDreduceSum(double *a, unsigned short length) { const int tid = threadIdx.x; unsigned short k = length; double *b; b =

Dot Product in CUDA using atomic operations - getting wrong results

人走茶凉 提交于 2019-12-25 02:56:14
问题 I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following: #include <stdio.h> #define N (2048 * 8) #define THREADS_PER_BLOCK 512 #define num_t float // The kernel - DOT PRODUCT __global__ void dot(num_t *a, num_t *b, num_t *c) { __shared__ num_t temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); //Synchronize! *c = 0