gpgpu

Insight into the first argument mask in __shfl__sync()

这一生的挚爱 提交于 2019-12-24 23:19:15
问题 Here is the test code for broadcasting variable: #include <stdio.h> #include <cuda_runtime.h> __global__ void broadcast(){ int lane_id = threadIdx.x & 0x1f; int value = 31 - lane_id; //let all lanes within the warp be broadcasted the value //whose laneID is 2 less than that of current lane int broadcasted_value = __shfl_up_sync(0xffffffff, value, 2) value = n; printf("thread %d final value = %d\n", threadIdx.x, value); } int main() { broadcast<<<1,32>>>(); cudaDeviceSynchronize(); return 0; }

Matrix multiplication on GPU. Memory bank conflicts and latency hiding

大憨熊 提交于 2019-12-24 14:25:16
问题 Edit: achievements over time is listed at the end of this question(~1Tflops/s yet). Im writing some kind of math library for C# using opencl(gpu) from C++ DLL and already done some optimizations on single precision square matrix-matrix multiplicatrion(for learning purposes and possibility of re-usage in a neural-network program later). Below kernel code gets v1 1D array as rows of matrix1(1024x1024) and v2 1D array as columns of matrix2((1024x1024)transpose optimization) and puts the result

CUDA compilation and Linking

强颜欢笑 提交于 2019-12-24 12:36:16
问题 I have host files (say h_A.cpp, etc) which can be compiled by host compiler ( g++ ), device files (say d_A.cu, etc) to be compiled by device compiler ( nvcc ) and host-device files i.e., host functions, kernel call, etc (say h_d_A.cu) to be compiled by device compiler ( nvcc ). Device side compilation nvcc -arch=sm_20 -dc d_A.cu -o d_A.o $(INCLUDES) /* -dc since the file may call / have relocatable device functions */ Host side compilation g++ -c h_A.cpp -o h_A.o $(INCLUDES, FLAGS) Device

Use GPU profiler (for example CodeXL) together with PyOpenCL

ⅰ亾dé卋堺 提交于 2019-12-24 11:57:13
问题 I have my complex PyOpenCL app with a lot of buffers creations, kernel templating and etc. I want to profile my app on GPU to see what is the bottle neck in my case. Is it possible to use some gpu profiler with PyOpenCl app? For example CodeXL. P.S. I know about event profiling but it isn't enough. 回答1: Yes, it is possible. Look here: http://devgurus.amd.com/message/1282742 来源: https://stackoverflow.com/questions/17573338/use-gpu-profiler-for-example-codexl-together-with-pyopencl

Metal - Namespace variable that is local to a thread?

与世无争的帅哥 提交于 2019-12-24 10:16:30
问题 I'm trying to create a Pseudo Random Number Generator (PRNG) in Metal, akin to thrust 's RNG , where every time you call the RNG within a thread it produces a different random number given a particular seed, which in this case, will be the thread_position_in_grid . I have it set up perfectly, and I get a nice uniformly random picture right now using the code I have. However, my code only works once per thread. I want to implement a next_rng() function that returns a new rng using the last

What is the optimum OpenCL 2 kernel to sum floats?

空扰寡人 提交于 2019-12-24 03:32:36
问题 C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2. Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan

Optimal workgroup size for sum reduction in OpenCL

╄→гoц情女王★ 提交于 2019-12-24 00:38:25
问题 I am using the following kernel for sum reduciton. __kernel void reduce(__global float* input, __global float* output, __local float* sdata) { // load shared mem unsigned int tid = get_local_id(0); unsigned int bid = get_group_id(0); unsigned int gid = get_global_id(0); unsigned int localSize = get_local_size(0); unsigned int stride = gid * 2; sdata[tid] = input[stride] + input[stride + 1]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction in shared mem for(unsigned int s = localSize >> 2; s > 0;

Lasagne dropoutlayer does not utilize GPU efficiently

寵の児 提交于 2019-12-24 00:37:13
问题 I am using theano and lasagne for a DNN speech enhancement project. I use a feed-forward network very similar to the mnist example in the lasagne documentation (/github.com/Lasagne/Lasagne/blob/master/examples/mnist.py). This network uses several dropout layers. I train my network on an Nvidia Titan X GPU. However, when I do not use dropout my GPU utilization is approximately 60% and one epoch takes around 60s but when I use dropout my GPU utilization drops to 8% and each epoch takes

simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

僤鯓⒐⒋嵵緔 提交于 2019-12-23 19:18:24
问题 I am new to GPU computing , but somewhere I've read that it's possible to execute a CUDA program without a GPU card using a simulator/ emulator. I have installed NVIDIA's GPU Computing SDK 4.0 and Visual C++ 2010 Express on Windows Vista. I would like to know: Whether it is feasible or not to run CUDA code without a GPU, using NVIDA's Computing SDK 4.0 and Visual C++ 2010 express? Why I get the following error, when I try to execute a sample program I have: ------ Build started: Project:

Kernel for processing a 4D tensor in CUDA

 ̄綄美尐妖づ 提交于 2019-12-23 04:46:01
问题 I want to write a kernel to perform computations that depends on all the unique quartets of indices (ij|kl). The code that generate all the unique quartets on the host is as follows: #include <iostream> using namespace std; int main(int argc, char** argv) { unsigned int i,j,k,l,ltop; unsigned int nao=7; for(i=0;i<nao;i++) { for(j=0;j<=i;j++) { for(k=0;k<=i;k++) { ltop=k; if(i==k) { ltop=j; } for(l=0;l<=ltop; l++) { printf("computing the ERI (%d,%d|%d,%d) \n",i,j,k,l); } } } } int m = nao*(nao