gpgpu

how does nvidia-smi work?

孤者浪人 提交于 2021-02-19 06:15:34
问题 What is the internal operation that allows nvidia-smi fetch the hardware level details? The tool executes even when some process is already running on the GPU device and gets the utilization details, name and id of the process etc. Is it possible to develop such a tool at the user level? How is NVML related? 回答1: Nvidia-smi is a thin wrapper around NVML. You can code with NVML with help of SDK contained in Tesla Deployment Kit. Everything that can be done with nvidia-smi can be queried

how does nvidia-smi work?

邮差的信 提交于 2021-02-19 06:15:15
问题 What is the internal operation that allows nvidia-smi fetch the hardware level details? The tool executes even when some process is already running on the GPU device and gets the utilization details, name and id of the process etc. Is it possible to develop such a tool at the user level? How is NVML related? 回答1: Nvidia-smi is a thin wrapper around NVML. You can code with NVML with help of SDK contained in Tesla Deployment Kit. Everything that can be done with nvidia-smi can be queried

Why is there a warp-level synchronization primitive in CUDA?

你。 提交于 2021-02-17 06:18:07
问题 I have two questions regarding __syncwarp() in CUDA: If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary? Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not

Getting started with shared memory on PyCUDA

大城市里の小女人 提交于 2021-02-08 10:35:59
问题 I'm trying to understand shared memory by playing with the following code: import pycuda.driver as drv import pycuda.tools import pycuda.autoinit import numpy from pycuda.compiler import SourceModule src=''' __global__ void reduce0(float *g_idata, float *g_odata) { extern __shared__ float sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do

GPU-based search for all possible paths between two nodes on a graph

牧云@^-^@ 提交于 2021-02-07 17:23:31
问题 My work makes extensive use of the algorithm by Migliore, Martorana and Sciortino for finding all possible simple paths, i.e. ones in which no node is encountered more than once, in a graph as described in: An Algorithm to find All Paths between Two Nodes in a Graph. (Although this algorithm is essentially a depth-first search and intuitively recursive in nature, the authors also present a non-recursive, stack-based implementation.) I'd like to know if such an algorithm can be implemented on

Retrieving values from arrayfire array as standard types and serialization

不想你离开。 提交于 2021-02-07 13:37:52
问题 I recently saw arrayfire demonstrated at GTC and I thought I would try it. Here are some questions I have run into while trying to use it. I am running Visual Studio 2013 on a Windows 7 system with OpenCL from the AMD App SDK 2.9-1. The biggest frustration is that I cannot view the state of array objects in the debugger to see what data is in it. I must rely on the af_print statement. That is very annoying. Is there any way to configure the debugger to let me see the data in the array without

Retrieving values from arrayfire array as standard types and serialization

白昼怎懂夜的黑 提交于 2021-02-07 13:37:45
问题 I recently saw arrayfire demonstrated at GTC and I thought I would try it. Here are some questions I have run into while trying to use it. I am running Visual Studio 2013 on a Windows 7 system with OpenCL from the AMD App SDK 2.9-1. The biggest frustration is that I cannot view the state of array objects in the debugger to see what data is in it. I must rely on the af_print statement. That is very annoying. Is there any way to configure the debugger to let me see the data in the array without

What is the difference between OpenCL and OpenGL's compute shader?

可紊 提交于 2021-02-05 12:54:07
问题 I know OpenCL gives control of the GPU's memory architecture and thus allows better optimization, but, leaving this aside, can we use Compute Shaders for vector operations (addition, multiplication, inversion, etc.)? 回答1: In contrast to the other OpenGL shader types, compute shaders are not directly related to computer graphics and provide a much more direct abstraction of the underlying hardware, similar to CUDA and OpenCL. It provides customizable work group size, shared memory, intra-group

CUDA error identifier “__stcs” is undefined

拜拜、爱过 提交于 2021-01-29 16:52:35
问题 I want to use store function with cache hint __stcs on GPU Pascal, CUDA 10.0. In CUDA C++ Programming Guide there is no mention of any header for data type unsigned long long , but the compiler return an error identifier "__stcs" is undefined . How to fix this compilation error? 回答1: These intrinsics require CUDA 11.0, they are new for CUDA 11.0. If you look at the CUDA 10.0 programming guide you will see they are not mentioned. You can also see that they are mentioned in the "changes from

Process strings form OpenCL kernel

喜欢而已 提交于 2021-01-29 07:22:46
问题 There are several strings like std::string first, second, third; ... My plan was to collect their addresses into a char* array: char *addresses = {&first[0], &second[0], &third[0]} ... and pass the char **addresses to the OpenCL kernel. There are several problems or questions: The main issue is that I cannot pass array of pointers. Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory? I'm using NVIDIA on Windows. So, I