gpgpu | 易学教程

Coding a CUDA Kernel that has many threads writing to the same index?

阅读更多关于 Coding a CUDA Kernel that has many threads writing to the same index?

问题 I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron. So here is the kernel code, and I'll try to explain it a bit clearer with the variables. __global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength) { int nx = threadIdx.x + TILE_WIDTH*threadIdx.y; int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;

terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)9>'

阅读更多关于 terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation'

问题 I am newbie in SYCL/OpenCL/GPGPU. I am trying to build and run sample code of constant addition program , #include <iostream> #include <array> #include <algorithm> #include <CL/sycl.hpp> namespace sycl = cl::sycl; //<<Define ConstantAdder>> template<typename T, typename Acc, size_t N> class ConstantAdder { public: ConstantAdder(Acc accessor, T val) : accessor(accessor) , val(val) {} void operator() () { for (size_t i = 0; i < N; i++) { accessor[i] += val; } } private: Acc accessor; const T

Dynamic programming in CUDA: global memory allocations to exchange data with child kernels

阅读更多关于 Dynamic programming in CUDA: global memory allocations to exchange data with child kernels

问题 I have a the following code: __global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; [...] double phi_cap1, phi_cap2; if(i<M) { for(int m=0; m<(2*K+1); m++) { [calculate phi_cap1]; for(int n=0; n<(2*K+1); n++) { [calculate phi_cap2]; [calculate phi_cap=phi_cap1*phi_cap2]; [use phi_cap]; } } } } I would like to use

Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

阅读更多关于 Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

问题 I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector : for(i=0;i<N;i++){ thrust::device_vector< float > Vec

GPU YUV to RGB. Worth the effort?

阅读更多关于 GPU YUV to RGB. Worth the effort?

问题 I have to convert several full PAL videos (720x576@25) from YUV 4:2:2 to RGB, in real time, and probably a custom resize for each. I have thought of using the GPU, as I have seen some example that does just this (except that it's 4:4:4 so the bpp is the same in source and destiny)-- http://www.fourcc.org/source/YUV420P-OpenGL-GLSLang.c However, I don't have any experience with using GPU's and I'm not sure of what can be done. The example, as I understand it, just converts the video frame to

GPU gives no performance improvement in Julia set computation

阅读更多关于 GPU gives no performance improvement in Julia set computation

问题 I am trying to compare performance in CPU and GPU. I have CPU : Intel® Core™ i5 CPU M 480 @ 2.67GHz × 4 GPU : NVidia GeForce GT 420M I can confirm that GPU is configured and works correctly with CUDA. I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white. Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a

OpenCL multiple command queue for Concurrent NDKernal Launch

阅读更多关于 OpenCL multiple command queue for Concurrent NDKernal Launch

问题 I m trying to run an application of vector addition, where i need to launch multiple kernels concurrently, so for concurrent kernel launch someone in my last question advised me to use multiple command queues. which i m defining by an array context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &err); for(i=0;i<num_ker;++i) { queue[i] = clCreateCommandQueue(context, device_id, 0, &err); } I m getting an error "command terminated by signal 11" some where around the above code. i m using

OpenGL ES 2.0 Vertex Shader Texture Reads not possible from FBO?

阅读更多关于 OpenGL ES 2.0 Vertex Shader Texture Reads not possible from FBO?

问题 I'm currently working on a GPGPU project that uses OpenGL ES 2.0. I have a rendering pipeline that uses framebuffer objects (FBOs) as targets, i.e. the result of each rendering pass is saved in a texture which is attached to an FBO. So far, this works when using fragment shaders. For example I have to following rendering pipeline: Preprocessing (downscaling, grayscale conversion) -> Adaptive Thresholding Pass 1 -> Adapt. Thresh. Pass 2 -> Copy back to CPU However, I wanted to extend this

Limitations of work-item load in GPU? CUDA/OpenCL

阅读更多关于 Limitations of work-item load in GPU? CUDA/OpenCL

问题 I have a compute-intensive image algorithm that, for each pixel, needs to read many distant pixels. The distance is dependent on a constant defined at compile-time. My OpenCL algorithm performs well, but at a certain maximum distance - resulting in more heavy for loops - the driver seems to bail out. The screen goes black for a couple of seconds and then the command queue never finishes. A balloon message reveals that the driver is unhappy: "Display driver AMD driver stopped responding and

OpenCL producing incorrect calculations

阅读更多关于 OpenCL producing incorrect calculations

问题 I've been trying to use openCL to do some calculations, but the results are incorrect. I input three float3's that look like this: [300000,0,0] [300000,300000,0] [300000,300000,300000] into this kernel: __kernel void gravitate(__global const float3 *position,__global const float3 *momentum,__global const float3 *mass,__global float3 *newPosition,__global float3 *newMomentum,unsigned int numBodies,unsigned int seconds) { int gid=get_global_id(0); newPosition[gid]=position[gid]*2; newMomentum