gpu-programming

Sharing roots and weights for many Gauss-Legendre Quadrature in GPUs

我的未来我决定 提交于 2019-12-12 03:13:12
问题 I am intending to compute in parallel fashion a lot of numerical quadratures that at the end of the day use a common set of data for all the computations ( a quite big arrays of roots and weights ocupying about 25 Kb of memory). The Gauss-Legendre quadrature method is simple enought to start with. I want to make available to all the threads in the device, the roots and weights, through the declaration device double *d_droot, *d_dweight. But I am missing something because I have to pass

Using CUDA Profiler nvprof for memory accesses

强颜欢笑 提交于 2019-12-12 01:59:21
问题 I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access

running NVENC sdk sample get error because there is not libnvidia-encode

ぐ巨炮叔叔 提交于 2019-12-11 21:26:56
问题 when I want to make nvEncodeApp NVENC SDK sample on centos 6.4 I got this error : /usr/bin/ld: cannot find -lnvidia-encode when I checked Make file the path to this library was here : -L/usr/lib64 -lnvidia-encode -ldl I checked /usr/lib64 but there is not any libnvidia-encode there: how this library will add to this path ,whats this library ? Using nvidia-smi should tell you that: nvidia-smi Tue Jul 16 20:19:20 2013 +------------------------------------------------------+ | NVIDIA-SMI 4.304

NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'compute_20'

旧城冷巷雨未停 提交于 2019-12-11 12:32:28
问题 On compilation of the CUDA SDK, I'm getting a nvcc fatal : Unsupported gpu architecture 'compute_20' My toolkit is 2.3 and on a shared system (i.e cant really upgrade) and the driver version is also 2.3, running on 4 Tesla C1060s If it helps, the problem is being called in radixsort. It appears that a few people online have had this problem but i havent found anywhere that actually gives a solution. 回答1: I believe compute_20 is targeting Fermi hardware, which you do not have. Also, Cuda 2.3

nvEncodeApp successfully make but in running it : NVENC error at CNVEncoder.cpp:1282 code=15 ( invalid struct version was used ) “nvStatus”

[亡魂溺海] 提交于 2019-12-11 09:28:00
问题 I make nvEncodeApp successfully but when I run it my output is like this ./nvEncoder -infile=HeavyHandIdiot.3sec.yuv -outfile=outh.264 -width=1080 -height=1080 > NVEncode configuration parameters for Encoder[0] > GPU Device ID = 0 > Input File = HeavyHandIdiot.3sec.yuv > Output File = outh.264 > Frames [000--01] = 0 frames > Multi-View Codec = No > Width,Height = [1080,1080] > Video Output Codec = 4 - H.264 Codec > Average Bitrate = 0 (bps/sec) > Peak Bitrate = 0 (bps/sec) > BufferSize = 0 >

Elementwise operations in OpenCL (Cuda)

独自空忆成欢 提交于 2019-12-11 08:57:24
问题 I build a kernel for elementwise multiplication of two matrices, but at least with my configurations my OpenCL kernel is only faster when each matrices is larger than 2GB. So I was wondering, if it is because of my naive kernel (see below) or because of the nature of elementwise operations, meaning that elementwise operations dont gain from using GPUs. Thanks for your input! kernel: KERNEL_CODE = """ // elementwise multiplication: C = A .* B. __kernel void matrixMul( __global float* C, _

compiling opencv with gpu cuda support

风流意气都作罢 提交于 2019-12-11 08:22:35
问题 I am using OpenCV 2.3.1 with CUDA 4.0. I have installed the OpenCV 2.3.1 by CMAKE with WITH_CUDA flag on. And then I compiled the OpenCV solution in release and debug mode, but still when I used the getCudaEnabledDevice function of CV::GPU it is returning 0. This means it is not detecting the CUDA enabled device. It seems that I have done everything right, still what is happening?? Can Anybody suggest where can be the problem now??. Thanks in advance. 回答1: I had the same problem. I fixed it

Different Image Block Sizes Using the GPU

拈花ヽ惹草 提交于 2019-12-11 07:53:28
问题 I wish to apply filter motion for certain number of iteration on different images, each image will be divided into different block size. For example, if the image size is 1024x870 ,how to divide this image into different block sizes 8x8 , 16x16 , 64x64 , etc. using MATLAB? 回答1: It's not perfect but I would do: A=rand(128); Apatch=im2col(A,[64 64],'distinct'); Apacth=gpuArray(Apatch); Otherwise you can try (I am not sure it speeds up): A=rand(128); A=gpuArray(A); Apatch=im2col(A,[64 64],

How to turn OpenCV_GPUMat into CUdeviceptr?

别说谁变了你拦得住时间么 提交于 2019-12-11 07:08:18
问题 I was modiying the NVTranscoder project from the Video_Codec_SDK_8.0.14 in order to adding some signal processing works into the video frames. However, I encounter some problems when I turn the GPUMat into CUdeviceptr. I was wondering how can I turn the GPUMat into CUdeviceptr. After I performed the blurring function where I have emphasized as below, I want to turn the processed mat into a CUdeviceptr. Besides, the part converting the CUdeviceptr into GPUmat is also wrong, as it shows the

Shared memory and streams when launching kernel

对着背影说爱祢 提交于 2019-12-11 04:08:54
问题 I'm new to CUDA and working on a personal project. I know that, if you want to specify the amount of shared memory at launch: kernel<<<grid_size,block_size,shared_mem_size>>>(parameters); On the other hand, if I want to put a kernel into a stream: kernel<<<grid_size,block_size,0,stream_being_used>>>(parameters); I don't understand why the third parameter is 0 in the case of stream? (I'm getting it from chapter 10 in "CUDA by examples" by Sanders and Kandrot). If I want to specify the shared