gpu-programming | 易学教程

Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints

阅读更多关于 Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints

I'm using nsight 2.2 , Toolkit 4.2 , latest nvidia driver , I'm using couple gpu's in my computer. Build customize 4.2. I have set "generate GPU ouput" on CUDA's project properties, nsight monitor is on (everything looks great). I set several break points on my global - kernel function . nsight stops at the declaration of the function , but skips over several break points. it's just like nsight decide whether to hit a break point or skip over a break point. The funny thing is that nsight stops at for loops , but doesn't stop on simple assignment operations. One more problem is that I can't set

Strange error while using cudaMemcpy: cudaErrorLaunchFailure

阅读更多关于 Strange error while using cudaMemcpy: cudaErrorLaunchFailure

I have a CUDA code which works like below: cpyDataGPU --> CPU while(nsteps){ cudaKernel1<<<,>>> function1(); cudaKernel2<<<,>>> } cpyDataGPU --> CPU And function1 is like that: function1{ cudaKernel3<<<,>>> cudaKernel4<<<,>>> cpyNewNeedDataCPU --> GPU // Error line cudaKernel5<<<,>>> } According to cudaMemcpy documentation , this function, can produce 4 differents error codes: "cudaSuccess", "cudaErrorInvalidValue", "cudaErrorInvalidDevicePointer" and "cudaErrorInvalidMemcpyDirection". However, I get the following error: "cudaErrorLaunchFailure": "An exception occurred on the device while

Poor performance when calling cudaMalloc with 2 GPUs simultaneously

阅读更多关于 Poor performance when calling cudaMalloc with 2 GPUs simultaneously

I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread. Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU. As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key

What are the programming languages for GPU

阅读更多关于 What are the programming languages for GPU

I read an article stating that GPU are the future of supercomputing. I would like to know what are the programming languages used for programming on GPU's Kyle Lutz OpenCL is the open and cross platform solution and runs on both GPUs and CPUs. Another is CUDA which is built by NVIDIA for their GPUs. HLSL,Cg are few others CUDA has quite a few language ports.. http://en.wikipedia.org/wiki/CUDA 来源： https://stackoverflow.com/questions/4057548/what-are-the-programming-languages-for-gpu

How do I test OpenCL on GPU when logged in remotely on Mac?

阅读更多关于 How do I test OpenCL on GPU when logged in remotely on Mac?

问题 My OpenCL program can find the GPU device when I am logged in at the console, but not when I am logged in remotely with ssh. Further, if I run the program as root in the ssh session, the program can find the GPU. The computer is a Snow Leopard Mac with a GeForce 9400 GPU. If I run the program (see below) from the console or as root, the output is as follows (notice the "GeForce 9400" line): 2 devices found Device #0 name = GeForce 9400 Device #1 name = Intel(R) Core(TM)2 Duo CPU P8700 @ 2

How to configure OpenCL in visual studio2010 for nvidia's gpu on windows?

阅读更多关于 How to configure OpenCL in visual studio2010 for nvidia's gpu on windows?

I am using NVIDIA's GeForce GTX 480 GPU on Wwindows 7 operating system on my ASUS laptop. I have already configured Visual Studio 2010 for CUDA 4.2. How to configure OpenCL for nvidia's gpu on visual studio 2010?? Have tries every possible way. Is it possible by any way to use 'CUDA toolkit (CUDA 4.2)' and 'nvidia's gpu computing sdk' to program OpenCL? If yes then How? If no then what is other way? KLee1 Yes. You should be able to use Visual Studio 2010 to program for OpenCL. It should simply be a case of making sure that you have the right include directories and libraries setup. Take a look

How could we generate random numbers in CUDA C with different seed on each run?

阅读更多关于 How could we generate random numbers in CUDA C with different seed on each run?

I am working on a stochastic process and I wanted to generate different series if random numbers in CUDA kernel each time I run the program. This similar to what we does in C++ by declaring seed = time(null) followed by srand(seed) and rand( ) I can pass seeds from host to device via the kernel but the problem in doing this is I would have to pass an entire array of seeds into the kernel for each thread to have a different random seed each time. Is there a way I could generate random seed / process if / machine time or something like than within the kernel and pass it as a seed? JackOLantern

Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

阅读更多关于 Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

In the documentation for CUDA 6.5 has written: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb 5.2.3. Multiprocessor Level ... 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as mentioned in Compute Capability 3.x. Does this mean that the GPU Kepler CC3.0 processors are not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different operations at one time): LOAD [addr1] -> ADD -> STORE [addr1] -> NOP NOP ->

Tensorflow new Op CUDA kernel memory management

阅读更多关于 Tensorflow new Op CUDA kernel memory management

I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel. This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table. Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU. My questions: Is it best practice to use Eigen::GPUDevice to manage GPU memory; By using Eigen::GPUDevice instead of the CUDA API I am "automatically

How to use hadoop MapReuce framework for an Opencl application?

阅读更多关于 How to use hadoop MapReuce framework for an Opencl application?

问题 I am developing an application in opencl whose basic objective is to implement a data mining algorithm on GPU platform. I want to use Hadoop Distributed File System and want to execute the application on multiple nodes. I am using MapReduce framework and I have divided my basic algorithm into two parts i.e. 'Map' and 'Reduce'. I have never worked in hadoop before so I have some questions: Do I have write my application in java only to use Hadoop and Mapeduce framework? I have written kernel