gpu | 易学教程

High global memory instruction overhead - no idea where it comes from

阅读更多关于 High global memory instruction overhead - no idea where it comes from

问题 I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional). The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on. Here is the kernel: __constant__ float q[20]; __global__ void compute_dists(float *pt, float *dst, int n, int d) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i <

Dependencies on cutil when using CUDA 5.0

阅读更多关于 Dependencies on cutil when using CUDA 5.0

问题 When I run the make command to complie a CUDA program under Linux 64bits, I receive the following error message: error: cutil.h: No such file or directory I found some answers, but none of them useful. In the makefile , there is one CUDA_SDK_PATH , but cannot find anything useful about the SDK in the CUDA Getting Started Guide: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html How should i set to the CUDA-SDK-PATH ? 回答1: If you are planning on using CUDA 5 or later,

How do I know which GPU would be sufficient for a problem? [closed]

阅读更多关于 How do I know which GPU would be sufficient for a problem? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . I'm using https://github.com/tensorflow/models/blob/master/research/object_detection for object detection, and am finding that running the vanilla Faster R-CNN algorithm on my computer for inference is running entirely too slowly (~15s to process one image). I don't have much experience with building real-world

Many OpenCL SDK's. Which of them i should choose?

阅读更多关于 Many OpenCL SDK's. Which of them i should choose?

问题 In my computer with Windows 7 OS I have three versions of OpenCL SDKS's from this vendors: Intel NVIDIA AMD. I build my application with each of them. As the output I have three different binaries. For example: my_app_intel_x86, my_app_amd_x86, my_app_nvidia_x86 This binaries are different on this: They use different SDK's in likange process They try to find different OpenCL platform name in runtime Can I use only one SDK and check platform on running time? 回答1: SDK's give debuggings tools, a

How does a GPU group threads into warps/wavefronts?

阅读更多关于 How does a GPU group threads into warps/wavefronts?

问题 My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block? For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index? Since by doing this, one can minimize

Find a GPU with enough memory

阅读更多关于 Find a GPU with enough memory

问题 I want to programmatically find out the available GPUs and their current memory usage and use one of the GPUs based on their memory availability. I want to do this in PyTorch. I have seen the following solution in this post: import torch.cuda as cutorch for i in range(cutorch.device_count()): if cutorch.getMemoryUsage(i) > MEM: opts.gpuID = i break but it is not working in PyTorch 0.3.1 (there is no function called, getMemoryUsage ). I am interested in a PyTorch based (using the library

OpenGL, measuring rendering time on gpu

阅读更多关于 OpenGL, measuring rendering time on gpu

问题 I have some big performance issues here So I would like to take some measurements on the gpu side. By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled) gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]); { draw(gl4); checkGlError(gl4); glad.swapBuffers(); } gl4.glEndQuery(GL4.GL_TIME_ELAPSED); gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0); And since OpenGL rendering

OpenGL, measuring rendering time on gpu

阅读更多关于 OpenGL, measuring rendering time on gpu

How to efficiently gather data from threads in CUDA?

阅读更多关于 How to efficiently gather data from threads in CUDA?

问题 I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host? I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel. I really like to optimize this. The main problem is concatenate a variable number of

cuBLAS cublasSgemv “Segmentation fault"

阅读更多关于 cuBLAS cublasSgemv “Segmentation fault"

问题 I have gotten a segmentation fault when running cublasSegmv.My GPU is K20Xm.Here is my code. float *a, *x, *y; int NUM_VEC = 8; y = (float*)malloc(sizeof(float) * rows * NUM_VEC); a = (float*)malloc(sizeof(float) * rows * cols); x = (float*)malloc(sizeof(float) * cols * NUM_VEC); get_mat_random(a, rows, cols); get_vec_random(x, cols * NUM_VEC); float *d_a = 0; float *d_x = 0; float *d_y = 0; cudaMalloc((void **)&d_a, rows * cols * sizeof(float); cudaMalloc((void **)&d_x, cols * NUM_VEC *