gpgpu | 易学教程

OpenCL: Store pointer to global memory in local memory?

阅读更多关于 OpenCL: Store pointer to global memory in local memory?

问题 any solutions? Is that even possible? __global *float abc; // pointer to global memory stored in private memory I want abc to be stored in local memory instead of private memory. 回答1: I think this is clarified here List 5.2: __global int global_data[128]; // 128 integers allocated on global memory __local float *lf; // pointer placed on the private memory, which points to a single-precision float located on the local memory __global char * __local lgc[8]; // 8 pointers stored on the local

Inter-block barrier on CUDA

阅读更多关于 Inter-block barrier on CUDA

问题 I want to implement a Inter-block barrier on CUDA, but encountering a serious problem. I cannot figure out why it does not work. #include <iostream> #include <cstdlib> #include <ctime> #define SIZE 10000000 #define BLOCKS 100 using namespace std; struct Barrier { int *count; __device__ void wait() { atomicSub(count, 1); while(*count) ; } Barrier() { int blocks = BLOCKS; cudaMalloc((void**) &count, sizeof(int)); cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice); } ~Barrier() {

CUDA streams destruction and CudaDeviceReset

阅读更多关于 CUDA streams destruction and CudaDeviceReset

问题 I have implemented the following class using CUDA streams class CudaStreams { private: int nStreams_; cudaStream_t* streams_; cudaStream_t active_stream_; public: // default constructor CudaStreams() { } // streams initialization void InitStreams(const int nStreams = 1) { nStreams_ = nStreams; // allocate and initialize an array of stream handles streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t)); for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i])))

CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

阅读更多关于 CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

问题 This question already has answers here : CUDA: How many concurrent threads in total? (3 answers) Closed 4 years ago . We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512. Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512

CUDA atomic operation performance in different scenarios

阅读更多关于 CUDA atomic operation performance in different scenarios

问题 When I came across this question on SO, I was curious to know the answer. so I wrote below piece of code to test atomic operation performance in different scenarios. The OS is Ubuntu 12.04 with CUDA 5.5 and the device is GeForce GTX780 (Kepler architecture). I compiled the code with -O3 flag and for CC=3.5. #include <stdio.h> static void HandleError( cudaError_t err, const char *file, int line ) { if (err != cudaSuccess) { printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line

Multithreaded backpropagation

阅读更多关于 Multithreaded backpropagation

问题 I have written a back propagation class in VB.NET -it works well- and I'm using it in a C# artificial intelligence project. But I have a AMD Phenom X3 at home and a Intel i5 at school. and my neural network is not multi-threaded. How to convert that back propagation class to a multithreaded algorithm? or how to use GPGPU programming in it? or should I use any third party libraries that have a multithreaded back propagation neural network? 回答1: JeffHeaton has recommend that you use resilient

CUDA: Thread ID assignment in 2D grid

阅读更多关于 CUDA: Thread ID assignment in 2D grid

问题 Let's suppose I have a kernel call with a 2D grid, like so: dim3 dimGrid(x, y); // not important what the actual values are dim3 dimBlock(blockSize, blockSize); myKernel <<< dimGrid, dimBlock >>>(); Now I've read that multidimensional grids are merely meant to ease programming - the underlying hardware will only ever use 1D linearly cached memory (unless you use texture memory, but that's not relevant here). My question is: In what order will the threads be assigned to the grid indices during

Processor Affinity in OpenCL

阅读更多关于 Processor Affinity in OpenCL

问题 Can we impose procssor affinity in OpenCl? For example thread# 1 executes on procesor# 5, thread# 2 executes on procesor# 6, thread# 3 executes on procesor# 7, and so on ? Thanks 回答1: You can't specify affinity at that low level with OpenCL as far as I know. But, starting with OpenCL 1.2 have some control over affinity by partitioning into subdevices using clCreateSubDevices (possibly with one processor in each subdevice by using CL_DEVICE_PARTITION_BY_COUNTS, 1 ) and running separate kernel

Integer calculations on GPU

阅读更多关于 Integer calculations on GPU

问题 For my work it's particularly interesting to do integer calculations, which obviously are not what GPUs were made for. My question is: Do modern GPUs support efficient integer operations? I realize this should be easy to figure out for myself, but I find conflicting answers (for example yes vs no), so I thought it best to ask. Also, are there any libraries/techniques for arbitrary precision integers on GPUs? 回答1: First, you need to consider the hardware you're using: GPU devices performance

OpenCL / AMD: Deep Learning [closed]

阅读更多关于 OpenCL / AMD: Deep Learning [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 11 months ago . While "googl'ing" and doing some research I were not able to find any serious/popular framework/sdk for scientific GPGPU-Computing and OpenCL on AMD hardware. Is there any literature and/or software I missed? Especially I am interested in deep learning . For all I know