gpgpu | 易学教程

Insight into the first argument mask in shflsync()

阅读更多关于 Insight into the first argument mask in __shfl__sync()

问题 Here is the test code for broadcasting variable: #include <stdio.h> #include <cuda_runtime.h> __global__ void broadcast(){ int lane_id = threadIdx.x & 0x1f; int value = 31 - lane_id; //let all lanes within the warp be broadcasted the value //whose laneID is 2 less than that of current lane int broadcasted_value = __shfl_up_sync(0xffffffff, value, 2) value = n; printf("thread %d final value = %d\n", threadIdx.x, value); } int main() { broadcast<<<1,32>>>(); cudaDeviceSynchronize(); return 0; }

Matrix multiplication on GPU. Memory bank conflicts and latency hiding

阅读更多关于 Matrix multiplication on GPU. Memory bank conflicts and latency hiding

问题 Edit: achievements over time is listed at the end of this question(~1Tflops/s yet). Im writing some kind of math library for C# using opencl(gpu) from C++ DLL and already done some optimizations on single precision square matrix-matrix multiplicatrion(for learning purposes and possibility of re-usage in a neural-network program later). Below kernel code gets v1 1D array as rows of matrix1(1024x1024) and v2 1D array as columns of matrix2((1024x1024)transpose optimization) and puts the result

CUDA compilation and Linking

阅读更多关于 CUDA compilation and Linking

问题 I have host files (say h_A.cpp, etc) which can be compiled by host compiler ( g++ ), device files (say d_A.cu, etc) to be compiled by device compiler ( nvcc ) and host-device files i.e., host functions, kernel call, etc (say h_d_A.cu) to be compiled by device compiler ( nvcc ). Device side compilation nvcc -arch=sm_20 -dc d_A.cu -o d_A.o $(INCLUDES) /* -dc since the file may call / have relocatable device functions */ Host side compilation g++ -c h_A.cpp -o h_A.o $(INCLUDES, FLAGS) Device

Use GPU profiler (for example CodeXL) together with PyOpenCL

阅读更多关于 Use GPU profiler (for example CodeXL) together with PyOpenCL

问题 I have my complex PyOpenCL app with a lot of buffers creations, kernel templating and etc. I want to profile my app on GPU to see what is the bottle neck in my case. Is it possible to use some gpu profiler with PyOpenCl app? For example CodeXL. P.S. I know about event profiling but it isn't enough. 回答1: Yes, it is possible. Look here: http://devgurus.amd.com/message/1282742 来源： https://stackoverflow.com/questions/17573338/use-gpu-profiler-for-example-codexl-together-with-pyopencl

Metal - Namespace variable that is local to a thread?

阅读更多关于 Metal - Namespace variable that is local to a thread?

问题 I'm trying to create a Pseudo Random Number Generator (PRNG) in Metal, akin to thrust 's RNG , where every time you call the RNG within a thread it produces a different random number given a particular seed, which in this case, will be the thread_position_in_grid . I have it set up perfectly, and I get a nice uniformly random picture right now using the code I have. However, my code only works once per thread. I want to implement a next_rng() function that returns a new rng using the last

What is the optimum OpenCL 2 kernel to sum floats?

阅读更多关于 What is the optimum OpenCL 2 kernel to sum floats?

问题 C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2. Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan

Optimal workgroup size for sum reduction in OpenCL

阅读更多关于 Optimal workgroup size for sum reduction in OpenCL

问题 I am using the following kernel for sum reduciton. __kernel void reduce(__global float* input, __global float* output, __local float* sdata) { // load shared mem unsigned int tid = get_local_id(0); unsigned int bid = get_group_id(0); unsigned int gid = get_global_id(0); unsigned int localSize = get_local_size(0); unsigned int stride = gid * 2; sdata[tid] = input[stride] + input[stride + 1]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction in shared mem for(unsigned int s = localSize >> 2; s > 0;

Lasagne dropoutlayer does not utilize GPU efficiently

阅读更多关于 Lasagne dropoutlayer does not utilize GPU efficiently

问题 I am using theano and lasagne for a DNN speech enhancement project. I use a feed-forward network very similar to the mnist example in the lasagne documentation (/github.com/Lasagne/Lasagne/blob/master/examples/mnist.py). This network uses several dropout layers. I train my network on an Nvidia Titan X GPU. However, when I do not use dropout my GPU utilization is approximately 60% and one epoch takes around 60s but when I use dropout my GPU utilization drops to 8% and each epoch takes

simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

阅读更多关于 simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

问题 I am new to GPU computing , but somewhere I've read that it's possible to execute a CUDA program without a GPU card using a simulator/ emulator. I have installed NVIDIA's GPU Computing SDK 4.0 and Visual C++ 2010 Express on Windows Vista. I would like to know: Whether it is feasible or not to run CUDA code without a GPU, using NVIDA's Computing SDK 4.0 and Visual C++ 2010 express? Why I get the following error, when I try to execute a sample program I have: ------ Build started: Project:

Kernel for processing a 4D tensor in CUDA

阅读更多关于 Kernel for processing a 4D tensor in CUDA

问题 I want to write a kernel to perform computations that depends on all the unique quartets of indices (ij|kl). The code that generate all the unique quartets on the host is as follows: #include <iostream> using namespace std; int main(int argc, char** argv) { unsigned int i,j,k,l,ltop; unsigned int nao=7; for(i=0;i<nao;i++) { for(j=0;j<=i;j++) { for(k=0;k<=i;k++) { ltop=k; if(i==k) { ltop=j; } for(l=0;l<=ltop; l++) { printf("computing the ERI (%d,%d|%d,%d) \n",i,j,k,l); } } } } int m = nao*(nao