gpgpu | 易学教程

Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

阅读更多关于 Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

Lets assume that I have a computer which has a multicore processor and a GPU. I would like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel? In theory yes, you can, the CL API allows it. But the platform/implementation must support it, and i don't think most CL implementatations do. To do it, get the cl_device_id of the CPU device and the GPU device, and create a context with those two devices, using clCreateContext. No you can't span automagically a kernel on both CPU and GPU, it's either one

Error compiling Cuda - expected primary-expression

阅读更多关于 Error compiling Cuda - expected primary-expression

问题 this program seems be fine but I still getting an erro, some suggestion? Program: #include "dot.h" #include <cuda.h> #include <cuda_runtime.h> #include <stdio.h> int main(int argc, char** argv) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof(int); cudaMalloc((void**)&dev_a, size); cudaMalloc((void**)&dev_b, size); cudaMalloc((void**)&dev_c, sizeof(int)); a = (int *)malloc (size); b = (int *)malloc (size); c = (int *)malloc (sizeof(int)); random_ints(a, N); random_ints(b, N

Is there a CUDA smart pointer?

阅读更多关于 Is there a CUDA smart pointer?

If not, what is the standard way to free up cudaMalloc ed memory when an exception is thrown? (Note that I am unable to use Thrust.) You can use RAII idiom and put your cudaMalloc() and cudaFree() calls to the constructor and destructor of your object respectively. Once the exception is thrown your destructor will be called which will free the allocated memory. If you wrap this object into a smart-pointer (or make it behave like a pointer) you will get your CUDA smart-pointer. 来源： https://stackoverflow.com/questions/16509414/is-there-a-cuda-smart-pointer

Which memory access pattern is more efficient for a cached GPU?

阅读更多关于 Which memory access pattern is more efficient for a cached GPU?

So lets say I have a global array of memory: |a|b|c| |e|f|g| |i|j|k| | There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads): 0 -> 1 -> 2 -> 3 t1 a -> b -> c -> . t2 e -> f -> g -> . t3 i -> j -> k -> . t4 . . . `> . The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern

Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints

阅读更多关于 Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints

I'm using nsight 2.2 , Toolkit 4.2 , latest nvidia driver , I'm using couple gpu's in my computer. Build customize 4.2. I have set "generate GPU ouput" on CUDA's project properties, nsight monitor is on (everything looks great). I set several break points on my global - kernel function . nsight stops at the declaration of the function , but skips over several break points. it's just like nsight decide whether to hit a break point or skip over a break point. The funny thing is that nsight stops at for loops , but doesn't stop on simple assignment operations. One more problem is that I can't set

MPI Receive/Gather Dynamic Vector Length

阅读更多关于 MPI Receive/Gather Dynamic Vector Length

I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system. I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors. class MachineData { unsigned long hostMemory; long cpuCores; int cudaDevices; public: std::vector<NviInfo> nviVec; std::vector<AmdInfo> amdVec; ... }; struct

How to manage same CUDA kernel call from multiple CPU threads?

阅读更多关于 How to manage same CUDA kernel call from multiple CPU threads?

问题 I have a cuda kernel which works fine when called from a single CPU threads. However when the same is called from multiple CPU threads (~100), most of the kernel seems not be executed at all as the results comes out to be all zeros.Can someone please guide me how to resolve this problem? In the current version of kernel I am using a cudadevicesynchronize() at the end of kernel call. Will adding a sync command before cudaMalloc() and kernel call be of any help in this case? There is another

Misaligned address in CUDA

阅读更多关于 Misaligned address in CUDA

问题 Can anyone tell me whats wrong with the following code inside a CUDA kernel: __constant__ unsigned char MT[256] = { 0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....}; typedef unsinged int U32; __global__ void Kernel (unsigned int *PT, unsigned int *CT, unsigned int *rk) { long int i; __shared__ unsigned char sh_MT[256]; for (i = 0; i < 64; i += 4) ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i]; __shared__ unsigned int sh_rkey[4]

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

阅读更多关于 Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

问题 Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions: How to evaluate CUDA performance? How Do You Profile & Optimize CUDA Kernels? How to calculate Gflops of a kernel Counting FLOPS/GFLOPS in program - CUDA How to calculate the achieved bandwidth of a CUDA kernel Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple

How to Step-by-Step Debug OpenCL GPU Applications under Windows with a NVidia GPU

阅读更多关于 How to Step-by-Step Debug OpenCL GPU Applications under Windows with a NVidia GPU

I would like to know wether you know of any way to step-by-step debug OpenCL Kernel using Windows (my IDE is Visual Studio) and running OpenCL Kernels on a NVidia GPU. What i found so far is: with NVidias NSight you can only profile OpenCL Applications, but not debug them the current version of the gDEBugger from AMD only supports ATI/AMD GPUs the old version of gDEBugger supports NVidia GPUs but work is discontinued in Dec '10 the GDB debugger seems to support it, but is only available under Linux the Intel OpenCL SDK brings a Debugger, but it only works while running the code on the CPU, not