gpu-programming | 易学教程

Sharing roots and weights for many Gauss-Legendre Quadrature in GPUs

阅读更多关于 Sharing roots and weights for many Gauss-Legendre Quadrature in GPUs

问题 I am intending to compute in parallel fashion a lot of numerical quadratures that at the end of the day use a common set of data for all the computations ( a quite big arrays of roots and weights ocupying about 25 Kb of memory). The Gauss-Legendre quadrature method is simple enought to start with. I want to make available to all the threads in the device, the roots and weights, through the declaration device double *d_droot, *d_dweight. But I am missing something because I have to pass

Using CUDA Profiler nvprof for memory accesses

阅读更多关于 Using CUDA Profiler nvprof for memory accesses

问题 I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access

running NVENC sdk sample get error because there is not libnvidia-encode

阅读更多关于 running NVENC sdk sample get error because there is not libnvidia-encode

问题 when I want to make nvEncodeApp NVENC SDK sample on centos 6.4 I got this error : /usr/bin/ld: cannot find -lnvidia-encode when I checked Make file the path to this library was here : -L/usr/lib64 -lnvidia-encode -ldl I checked /usr/lib64 but there is not any libnvidia-encode there: how this library will add to this path ,whats this library ? Using nvidia-smi should tell you that: nvidia-smi Tue Jul 16 20:19:20 2013 +------------------------------------------------------+ | NVIDIA-SMI 4.304

NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'compute_20'

阅读更多关于 NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'compute_20'

问题 On compilation of the CUDA SDK, I'm getting a nvcc fatal : Unsupported gpu architecture 'compute_20' My toolkit is 2.3 and on a shared system (i.e cant really upgrade) and the driver version is also 2.3, running on 4 Tesla C1060s If it helps, the problem is being called in radixsort. It appears that a few people online have had this problem but i havent found anywhere that actually gives a solution. 回答1: I believe compute_20 is targeting Fermi hardware, which you do not have. Also, Cuda 2.3

nvEncodeApp successfully make but in running it : NVENC error at CNVEncoder.cpp:1282 code=15 ( invalid struct version was used ) “nvStatus”

阅读更多关于 nvEncodeApp successfully make but in running it : NVENC error at CNVEncoder.cpp:1282 code=15 ( invalid struct version was used ) “nvStatus”

问题 I make nvEncodeApp successfully but when I run it my output is like this ./nvEncoder -infile=HeavyHandIdiot.3sec.yuv -outfile=outh.264 -width=1080 -height=1080 > NVEncode configuration parameters for Encoder[0] > GPU Device ID = 0 > Input File = HeavyHandIdiot.3sec.yuv > Output File = outh.264 > Frames [000--01] = 0 frames > Multi-View Codec = No > Width,Height = [1080,1080] > Video Output Codec = 4 - H.264 Codec > Average Bitrate = 0 (bps/sec) > Peak Bitrate = 0 (bps/sec) > BufferSize = 0 >

Elementwise operations in OpenCL (Cuda)

阅读更多关于 Elementwise operations in OpenCL (Cuda)

问题 I build a kernel for elementwise multiplication of two matrices, but at least with my configurations my OpenCL kernel is only faster when each matrices is larger than 2GB. So I was wondering, if it is because of my naive kernel (see below) or because of the nature of elementwise operations, meaning that elementwise operations dont gain from using GPUs. Thanks for your input! kernel: KERNEL_CODE = """ // elementwise multiplication: C = A .* B. __kernel void matrixMul( __global float* C, _

compiling opencv with gpu cuda support

阅读更多关于 compiling opencv with gpu cuda support

问题 I am using OpenCV 2.3.1 with CUDA 4.0. I have installed the OpenCV 2.3.1 by CMAKE with WITH_CUDA flag on. And then I compiled the OpenCV solution in release and debug mode, but still when I used the getCudaEnabledDevice function of CV::GPU it is returning 0. This means it is not detecting the CUDA enabled device. It seems that I have done everything right, still what is happening?? Can Anybody suggest where can be the problem now??. Thanks in advance. 回答1: I had the same problem. I fixed it

Different Image Block Sizes Using the GPU

阅读更多关于 Different Image Block Sizes Using the GPU

问题 I wish to apply filter motion for certain number of iteration on different images, each image will be divided into different block size. For example, if the image size is 1024x870 ,how to divide this image into different block sizes 8x8 , 16x16 , 64x64 , etc. using MATLAB? 回答1: It's not perfect but I would do: A=rand(128); Apatch=im2col(A,[64 64],'distinct'); Apacth=gpuArray(Apatch); Otherwise you can try (I am not sure it speeds up): A=rand(128); A=gpuArray(A); Apatch=im2col(A,[64 64],

How to turn OpenCV_GPUMat into CUdeviceptr?

阅读更多关于 How to turn OpenCV_GPUMat into CUdeviceptr?

问题 I was modiying the NVTranscoder project from the Video_Codec_SDK_8.0.14 in order to adding some signal processing works into the video frames. However, I encounter some problems when I turn the GPUMat into CUdeviceptr. I was wondering how can I turn the GPUMat into CUdeviceptr. After I performed the blurring function where I have emphasized as below, I want to turn the processed mat into a CUdeviceptr. Besides, the part converting the CUdeviceptr into GPUmat is also wrong, as it shows the

Shared memory and streams when launching kernel

阅读更多关于 Shared memory and streams when launching kernel

问题 I'm new to CUDA and working on a personal project. I know that, if you want to specify the amount of shared memory at launch: kernel<<<grid_size,block_size,shared_mem_size>>>(parameters); On the other hand, if I want to put a kernel into a stream: kernel<<<grid_size,block_size,0,stream_being_used>>>(parameters); I don't understand why the third parameter is 0 in the case of stream? (I'm getting it from chapter 10 in "CUDA by examples" by Sanders and Kandrot). If I want to specify the shared