nvidia | 易学教程

What is a bank conflict? (Doing Cuda/OpenCL programming)

阅读更多关于 What is a bank conflict? (Doing Cuda/OpenCL programming)

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science. Grizzly For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be

External calls are not supported - CUDA

阅读更多关于 External calls are not supported - CUDA

问题 Objective is to call a device function available in another file, when i compile the global kernel it shows the following error *External calls are not supported (found non-inlined call to _Z6GoldenSectionCUDA)* . Problematic Code (not the full code but where the problem arises), cat norm.h # ifndef NORM_H_ # define NORM_H_ # include<stdio.h> __device__ double invcdf(double prob, double mean, double stddev); #endif cat norm.cu # include <norm.h> __device__ double invcdf(double prob, double

Are cuda kernel calls synchronous or asynchronous

阅读更多关于 Are cuda kernel calls synchronous or asynchronous

问题 I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before

Is it possible to run CUDA on AMD GPUs?

阅读更多关于 Is it possible to run CUDA on AMD GPUs?

问题 I'd like to extend my skill set into GPU computing. I am familiar with raytracing and realtime graphics(OpenGL), but the next generation of graphics and high performance computing seems to be in GPU computing or something like it. I currently use an AMD HD 7870 graphics card on my home computer. Could I write CUDA code for this? (my intuition is no, but since Nvidia released the compiler binaries I might be wrong). A second more general question is, Where do I start with GPU computing? I'm

Cuda Random Number Generation

阅读更多关于 Cuda Random Number Generation

问题 I was wondering what was the best way to generate one pseudo random number between 0 and 49k that would be the same for each thread, by using curand or something else. I prefer to generate the random numbers inside the kernel because I will have to generate one at the time but about 10k times. And I could use floats between 0.0 and 1.0, but I've no idea how to make my PRN available for all threads, because most post and example show how to have different PRN for each threads. Thanks 回答1:

[C]Ubuntu 13.04实现NVIDIA双显卡切换

阅读更多关于 [C]Ubuntu 13.04实现NVIDIA双显卡切换

首先添加PPA源。打开终端，输入： sudo add-apt-repository ppa:bumblebee/stable 然后输入： sudo apt-get update ==================================================== 开始安装。在终端输入： sudo apt-get install bumblebee (该命令自动安**umblebee 3) sudo apt-get install bumblebee-nvidia (该命令安装NVIDIA官方驱动) 安装完成，重启！！！ ==================================================== 重启后验证是否成功。打开终端，输入： lspci |grep -i vga 注意，下面列出了Intel集显和NVIDIA独显的工作情况，NVIDIA显卡信息后面有个“rev ff”，表示独显已经关闭。再输入： sudo optirun glxgears 这是会跳出一个名为“glxgears”，显示3D齿轮画面的窗口，不要关闭窗口。打开另一个终端，输入： lspci |grep -i vga 下面又列出Intel集显和NVIDIA独显的工作情况，NVIDIA显卡信息后面显示“rev+数字”的形式，表示独显已经开启，正在工作。然后

Horrible redraw performance of the DataGridView on one of my two screens

阅读更多关于 Horrible redraw performance of the DataGridView on one of my two screens

I've actually solved this, but I'm posting it for posterity. I ran into a very odd issue with the DataGridView on my dual-monitor system. The issue manifests itself as an EXTREMELY slow repaint of the control ( like 30 seconds for a full repaint ), but only when it is on one of my screens. When on the other, the repaint speed is fine. I have an Nvidia 8800 GT with the latest non-beta drivers (175. something). Is it a driver bug? I'll leave that up in the air, since I have to live with this particular configuration. (It does not happen on ATI cards, though...) The paint speed has nothing to do

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

阅读更多关于 Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

How are threads organized to be executed by a GPU? cibercitizen1 Hardware If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn). Software threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case x y z <= 768 for our example (other restrictions apply to x,y,z

How do CUDA blocks/warps/threads map onto CUDA cores?

阅读更多关于 How do CUDA blocks/warps/threads map onto CUDA cores?

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread. I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern. First of all, I would like to understand if I got these facts straight: The programmer writes a kernel, and organize its execution in a grid of thread blocks. Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM. Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the

How to measure the inner kernel time in NVIDIA CUDA?

阅读更多关于 How to measure the inner kernel time in NVIDIA CUDA?

I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA? e.g. __global__ void kernelSample() { some code here get start time some code here get stop time some code here } Try this, it measures time between 2 events in milliseconds. cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); cudaEventRecord(start,0); //Do kernel activity here cudaEventCreate(&stop); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsedTime, start,stop); printf("Elapsed time : %f ms\n" ,elapsedTime); You can do something like this: __global__ void