gpu

High global memory instruction overhead - no idea where it comes from

試著忘記壹切 提交于 2020-01-05 07:56:11
问题 I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional). The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on. Here is the kernel: __constant__ float q[20]; __global__ void compute_dists(float *pt, float *dst, int n, int d) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i <

Dependencies on cutil when using CUDA 5.0

流过昼夜 提交于 2020-01-05 05:51:13
问题 When I run the make command to complie a CUDA program under Linux 64bits, I receive the following error message: error: cutil.h: No such file or directory I found some answers, but none of them useful. In the makefile , there is one CUDA_SDK_PATH , but cannot find anything useful about the SDK in the CUDA Getting Started Guide: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html How should i set to the CUDA-SDK-PATH ? 回答1: If you are planning on using CUDA 5 or later,

How do I know which GPU would be sufficient for a problem? [closed]

馋奶兔 提交于 2020-01-05 04:07:10
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . I'm using https://github.com/tensorflow/models/blob/master/research/object_detection for object detection, and am finding that running the vanilla Faster R-CNN algorithm on my computer for inference is running entirely too slowly (~15s to process one image). I don't have much experience with building real-world

Many OpenCL SDK's. Which of them i should choose?

青春壹個敷衍的年華 提交于 2020-01-05 02:31:33
问题 In my computer with Windows 7 OS I have three versions of OpenCL SDKS's from this vendors: Intel NVIDIA AMD. I build my application with each of them. As the output I have three different binaries. For example: my_app_intel_x86, my_app_amd_x86, my_app_nvidia_x86 This binaries are different on this: They use different SDK's in likange process They try to find different OpenCL platform name in runtime Can I use only one SDK and check platform on running time? 回答1: SDK's give debuggings tools, a

How does a GPU group threads into warps/wavefronts?

巧了我就是萌 提交于 2020-01-04 09:38:27
问题 My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block? For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index? Since by doing this, one can minimize

Find a GPU with enough memory

≯℡__Kan透↙ 提交于 2020-01-04 05:12:38
问题 I want to programmatically find out the available GPUs and their current memory usage and use one of the GPUs based on their memory availability. I want to do this in PyTorch. I have seen the following solution in this post: import torch.cuda as cutorch for i in range(cutorch.device_count()): if cutorch.getMemoryUsage(i) > MEM: opts.gpuID = i break but it is not working in PyTorch 0.3.1 (there is no function called, getMemoryUsage ). I am interested in a PyTorch based (using the library

OpenGL, measuring rendering time on gpu

旧巷老猫 提交于 2020-01-03 16:51:50
问题 I have some big performance issues here So I would like to take some measurements on the gpu side. By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled) gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]); { draw(gl4); checkGlError(gl4); glad.swapBuffers(); } gl4.glEndQuery(GL4.GL_TIME_ELAPSED); gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0); And since OpenGL rendering

OpenGL, measuring rendering time on gpu

拈花ヽ惹草 提交于 2020-01-03 16:51:18
问题 I have some big performance issues here So I would like to take some measurements on the gpu side. By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled) gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]); { draw(gl4); checkGlError(gl4); glad.swapBuffers(); } gl4.glEndQuery(GL4.GL_TIME_ELAPSED); gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0); And since OpenGL rendering

How to efficiently gather data from threads in CUDA?

半世苍凉 提交于 2020-01-03 09:24:13
问题 I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host? I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel. I really like to optimize this. The main problem is concatenate a variable number of

cuBLAS cublasSgemv “Segmentation fault"

若如初见. 提交于 2020-01-03 04:57:12
问题 I have gotten a segmentation fault when running cublasSegmv.My GPU is K20Xm.Here is my code. float *a, *x, *y; int NUM_VEC = 8; y = (float*)malloc(sizeof(float) * rows * NUM_VEC); a = (float*)malloc(sizeof(float) * rows * cols); x = (float*)malloc(sizeof(float) * cols * NUM_VEC); get_mat_random(a, rows, cols); get_vec_random(x, cols * NUM_VEC); float *d_a = 0; float *d_x = 0; float *d_y = 0; cudaMalloc((void **)&d_a, rows * cols * sizeof(float); cudaMalloc((void **)&d_x, cols * NUM_VEC *