gpgpu

Coding a CUDA Kernel that has many threads writing to the same index?

大憨熊 提交于 2019-12-11 05:17:06
问题 I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron. So here is the kernel code, and I'll try to explain it a bit clearer with the variables. __global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength) { int nx = threadIdx.x + TILE_WIDTH*threadIdx.y; int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;

terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)9>'

ε祈祈猫儿з 提交于 2019-12-11 05:06:19
问题 I am newbie in SYCL/OpenCL/GPGPU. I am trying to build and run sample code of constant addition program , #include <iostream> #include <array> #include <algorithm> #include <CL/sycl.hpp> namespace sycl = cl::sycl; //<<Define ConstantAdder>> template<typename T, typename Acc, size_t N> class ConstantAdder { public: ConstantAdder(Acc accessor, T val) : accessor(accessor) , val(val) {} void operator() () { for (size_t i = 0; i < N; i++) { accessor[i] += val; } } private: Acc accessor; const T

Dynamic programming in CUDA: global memory allocations to exchange data with child kernels

萝らか妹 提交于 2019-12-11 04:43:39
问题 I have a the following code: __global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; [...] double phi_cap1, phi_cap2; if(i<M) { for(int m=0; m<(2*K+1); m++) { [calculate phi_cap1]; for(int n=0; n<(2*K+1); n++) { [calculate phi_cap2]; [calculate phi_cap=phi_cap1*phi_cap2]; [use phi_cap]; } } } } I would like to use

Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

点点圈 提交于 2019-12-11 04:25:34
问题 I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector : for(i=0;i<N;i++){ thrust::device_vector< float > Vec

GPU YUV to RGB. Worth the effort?

江枫思渺然 提交于 2019-12-11 03:47:47
问题 I have to convert several full PAL videos (720x576@25) from YUV 4:2:2 to RGB, in real time, and probably a custom resize for each. I have thought of using the GPU, as I have seen some example that does just this (except that it's 4:4:4 so the bpp is the same in source and destiny)-- http://www.fourcc.org/source/YUV420P-OpenGL-GLSLang.c However, I don't have any experience with using GPU's and I'm not sure of what can be done. The example, as I understand it, just converts the video frame to

GPU gives no performance improvement in Julia set computation

北战南征 提交于 2019-12-11 03:35:35
问题 I am trying to compare performance in CPU and GPU. I have CPU : Intel® Core™ i5 CPU M 480 @ 2.67GHz × 4 GPU : NVidia GeForce GT 420M I can confirm that GPU is configured and works correctly with CUDA. I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white. Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a

OpenCL multiple command queue for Concurrent NDKernal Launch

筅森魡賤 提交于 2019-12-11 03:13:12
问题 I m trying to run an application of vector addition, where i need to launch multiple kernels concurrently, so for concurrent kernel launch someone in my last question advised me to use multiple command queues. which i m defining by an array context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &err); for(i=0;i<num_ker;++i) { queue[i] = clCreateCommandQueue(context, device_id, 0, &err); } I m getting an error "command terminated by signal 11" some where around the above code. i m using

OpenGL ES 2.0 Vertex Shader Texture Reads not possible from FBO?

為{幸葍}努か 提交于 2019-12-11 02:59:51
问题 I'm currently working on a GPGPU project that uses OpenGL ES 2.0. I have a rendering pipeline that uses framebuffer objects (FBOs) as targets, i.e. the result of each rendering pass is saved in a texture which is attached to an FBO. So far, this works when using fragment shaders. For example I have to following rendering pipeline: Preprocessing (downscaling, grayscale conversion) -> Adaptive Thresholding Pass 1 -> Adapt. Thresh. Pass 2 -> Copy back to CPU However, I wanted to extend this

Limitations of work-item load in GPU? CUDA/OpenCL

只谈情不闲聊 提交于 2019-12-11 02:57:08
问题 I have a compute-intensive image algorithm that, for each pixel, needs to read many distant pixels. The distance is dependent on a constant defined at compile-time. My OpenCL algorithm performs well, but at a certain maximum distance - resulting in more heavy for loops - the driver seems to bail out. The screen goes black for a couple of seconds and then the command queue never finishes. A balloon message reveals that the driver is unhappy: "Display driver AMD driver stopped responding and

OpenCL producing incorrect calculations

て烟熏妆下的殇ゞ 提交于 2019-12-11 02:54:30
问题 I've been trying to use openCL to do some calculations, but the results are incorrect. I input three float3's that look like this: [300000,0,0] [300000,300000,0] [300000,300000,300000] into this kernel: __kernel void gravitate(__global const float3 *position,__global const float3 *momentum,__global const float3 *mass,__global float3 *newPosition,__global float3 *newMomentum,unsigned int numBodies,unsigned int seconds) { int gid=get_global_id(0); newPosition[gid]=position[gid]*2; newMomentum