gpu-programming

Compile and build .cl file using NVIDIA's nvcc Compiler?

亡梦爱人 提交于 2019-12-08 02:49:44
问题 Is it possible to compile .cl file using NVIDIA's nvcc compiler?? I am trying to set up visual studio 2010 to code Opencl under CUDA platform. But when I select CUDA C/C++ Compiler to compile and build .cl file, it gives me errors like nvcc does not exist. What is the issue? 回答1: You should be able to use nvcc to compile OpenCL codes. Normally, I would suggest using a filename extension of .c for a C-compliant code, and .cpp for a C++ compliant code(*), however nvcc has filename extension

Handling Ctrl+C exception with GPU

断了今生、忘了曾经 提交于 2019-12-08 00:48:31
I am working with some GPU programs (using CUDA 4.1 and C), and sometimes (rarely) I have to kill the program midway using Ctrl+C to handle some exception. Earlier I tried using CudaDeviceReset() function, but this reply by talonmies displaced my trust in CudaDeviceReset() and hence I started handling such exceptions the Old-Fashioned way, that is 'computer restart'. As the project size grows, this method is becoming a headache. I would appreciate if anyone has come up with a better solution. harrism I think this question is more fundamental -- it is really an app design issue and not a CUDA

Strange error while using cudaMemcpy: cudaErrorLaunchFailure

大憨熊 提交于 2019-12-07 05:13:25
问题 I have a CUDA code which works like below: cpyDataGPU --> CPU while(nsteps){ cudaKernel1<<<,>>> function1(); cudaKernel2<<<,>>> } cpyDataGPU --> CPU And function1 is like that: function1{ cudaKernel3<<<,>>> cudaKernel4<<<,>>> cpyNewNeedDataCPU --> GPU // Error line cudaKernel5<<<,>>> } According to cudaMemcpy documentation, this function, can produce 4 differents error codes: "cudaSuccess", "cudaErrorInvalidValue", "cudaErrorInvalidDevicePointer" and "cudaErrorInvalidMemcpyDirection". However

OpenCL image histogram

自闭症网瘾萝莉.ら 提交于 2019-12-07 02:44:35
问题 I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this: const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP| CLK_FILTER_NEAREST; __kernel void computeHistogram(read_only image2d_t input, __global int* rOutput, __global int* gOutput, __global int* bOutput) { int2 coords = {get_global_id(0), get_global_id(1)}; float4 sample = read_imagef(input, mSampler, coords); uchar rbin = floor

Poor performance when calling cudaMalloc with 2 GPUs simultaneously

做~自己de王妃 提交于 2019-12-07 02:27:22
问题 I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread. Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU. As you can see a big portion of the GPU processing time is consumed by the two sorting operations

Compile and build .cl file using NVIDIA's nvcc Compiler?

僤鯓⒐⒋嵵緔 提交于 2019-12-06 13:05:34
Is it possible to compile .cl file using NVIDIA's nvcc compiler?? I am trying to set up visual studio 2010 to code Opencl under CUDA platform. But when I select CUDA C/C++ Compiler to compile and build .cl file, it gives me errors like nvcc does not exist. What is the issue? You should be able to use nvcc to compile OpenCL codes. Normally, I would suggest using a filename extension of .c for a C-compliant code, and .cpp for a C++ compliant code(*), however nvcc has filename extension override options ( -x ... ) so that we can modify the behavior. Here is a worked example using CUDA 8.0.61,

A question about the details about the distribution from blocks to SMs in CUDA

泪湿孤枕 提交于 2019-12-06 11:58:17
Let me take the hardware with computation ability 1.3 as an example. 30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources. My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished? I wrote such a piece of code. #include<stdio.h> #include<string.h> #include<cuda_runtime.h>

F#/“Accelerator v2” DFT algorithm implementation probably incorrect

强颜欢笑 提交于 2019-12-06 04:41:36
I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform. I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don

Tensorflow new Op CUDA kernel memory management

风格不统一 提交于 2019-12-06 02:57:34
问题 I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel. This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table. Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU. My questions: Is it best practice to use Eigen

CUDA Visual Studio 2010 Express build error

南笙酒味 提交于 2019-12-05 15:55:36
I am trying to get started with CUDA programming on Windows using Visual Studio 2010 Express on a 64 bit Windows 7. It took me a while setting up the environment, and I just wrote my first program, helloWorld.cu :) Currently I am working with the following program: #include <stdio.h> __global__ void add(int a, int b, int *c){ *c = a + b; } int main(void){ int c; int *dev_c; HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) ); add<<<1,1>>>(2, 7, dev_c); HANDLE_ERROR( cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost ) ); printf("2 + 7 = %d\n", c); cudaFree( dev_c ); return 0; }