gpgpu | 易学教程

MPI Receive/Gather Dynamic Vector Length

阅读更多关于 MPI Receive/Gather Dynamic Vector Length

问题 I have an application that stores a vector of structs. These structs hold information about each GPU on a system like memory and giga-flop/s. There are a different number of GPUs on each system. I have a program that runs on multiple machines at once and I need to collect this data. I am very new to MPI but am able to use MPI_Gather() for the most part, however I would like to know how to gather/receive these dynamically sized vectors. class MachineData { unsigned long hostMemory; long

OpenCL image histogram

阅读更多关于 OpenCL image histogram

问题 I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this: const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP| CLK_FILTER_NEAREST; __kernel void computeHistogram(read_only image2d_t input, __global int* rOutput, __global int* gOutput, __global int* bOutput) { int2 coords = {get_global_id(0), get_global_id(1)}; float4 sample = read_imagef(input, mSampler, coords); uchar rbin = floor

How to Step-by-Step Debug OpenCL GPU Applications under Windows with a NVidia GPU

阅读更多关于 How to Step-by-Step Debug OpenCL GPU Applications under Windows with a NVidia GPU

问题 I would like to know wether you know of any way to step-by-step debug OpenCL Kernel using Windows (my IDE is Visual Studio) and running OpenCL Kernels on a NVidia GPU. What i found so far is: with NVidias NSight you can only profile OpenCL Applications, but not debug them the current version of the gDEBugger from AMD only supports ATI/AMD GPUs the old version of gDEBugger supports NVidia GPUs but work is discontinued in Dec '10 the GDB debugger seems to support it, but is only available under

OpenCL FFT on both Nvidia and AMD hardware?

阅读更多关于 OpenCL FFT on both Nvidia and AMD hardware?

问题 I'm working on a project that needs to make use of FFTs on both Nvidia and AMD graphics cards. I initially looked for a library that would work on both (thinking this would be the OpenCL way) but I wasn't having any luck. Someone suggested to me that I would have to use each vendor's FFT implementation and write a wrapper that chose what to do based on the platform. I found AMD's implementation pretty easily, but I'm actually working with an Nvidia card in the meantime (and this is the more

Manually set a 1D Texture in Metal

阅读更多关于 Manually set a 1D Texture in Metal

问题 I'm trying to fill a 1D texture with values manually and pass that texture to a compute shader (these are 2 pixels that I want to set via code, they don't represent any image). Due to the current small amount of Metal examples, all examples I could find deal with 2D textures that load the texture by converting a loaded UIImage to raw bytes data, but creating a dummy UIImage felt like a hack for me. This is the "naive" way I started with - ... var manualTextureData: [Float] = [ 1.0, 0.0, 0.0,

Compile and build .cl file using NVIDIA's nvcc Compiler?

阅读更多关于 Compile and build .cl file using NVIDIA's nvcc Compiler?

Is it possible to compile .cl file using NVIDIA's nvcc compiler?? I am trying to set up visual studio 2010 to code Opencl under CUDA platform. But when I select CUDA C/C++ Compiler to compile and build .cl file, it gives me errors like nvcc does not exist. What is the issue? You should be able to use nvcc to compile OpenCL codes. Normally, I would suggest using a filename extension of .c for a C-compliant code, and .cpp for a C++ compliant code(*), however nvcc has filename extension override options ( -x ... ) so that we can modify the behavior. Here is a worked example using CUDA 8.0.61,

A question about the details about the distribution from blocks to SMs in CUDA

阅读更多关于 A question about the details about the distribution from blocks to SMs in CUDA

Let me take the hardware with computation ability 1.3 as an example. 30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources. My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished? I wrote such a piece of code. #include<stdio.h> #include<string.h> #include<cuda_runtime.h>

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

阅读更多关于 SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required. Why use SIMD if we have GPGPU? SIMD intrinsics - are they usable on gpus? CPU SIMD vs GPU SIMD? Are following points correct,if I compile the code in SIMD-8 mode ? 1) it means 8 instructions of different work items are getting executing in parallel. 2) Does it mean All work items are executing the same instruction only? 3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean

How to profile PyCuda code in Linux?

阅读更多关于 How to profile PyCuda code in Linux?

I have a simple (tested) pycuda app and am trying to profile it. I've tried NVidia's Compute Visual Profiler, which runs the program 11 times, then emits this error: NV_Warning: Ignoring the invalid profiler config option: fb0_subp0_read_sectors Error : Profiler data file '/home/jguy/proj/gpu/tdbp/pyArch/temp_compute_profiler_0_0.csv' does not contain profiler output.This can happen when: a) Profiling is disabled during the entire run of the application. b) The application does not invoke any kernel launches or memory transfers. c) The application does not release resources (contexts, events,

OpenGL 4.0 GPU Draw Feature?

阅读更多关于 OpenGL 4.0 GPU Draw Feature?

In Wikipedia and other sources' description of OpenGL 4.0 I read about this feature: Drawing of data generated by OpenGL or external APIs such as OpenCL, without CPU intervention. What is this referring to? Edit : Seems like this must be referring to Draw_Indirect which I believe somehow extends the draw phase to include feedback from shader programs or programs from interop (OpenCL/CUDA basically) It looks as if there are a few caveats and tricks to getting the calls to keep staying on the GPU for any extended amount of time past the second run but it should be possible. If anyone can provide