gpgpu

Does Apache Mesos recognize GPU cores?

偶尔善良 提交于 2020-01-13 02:42:10
问题 In slide 25 of this talk by Twitter's Head of Open Source office, the presenter says that Mesos allows one to track and manage even GPU (I assume he meant GPGPU) resources. But I cant find any information on this anywhere else. Can someone please help? Besides Mesos, are there other cluster managers that support GPGPU? 回答1: Mesos does not yet provide direct support for (GP)GPUs, but does support custom resource types. If you specify --resources="gpu(*):8" when starting the mesos-slave, then

GPU Shared Memory Bank Conflict

此生再无相见时 提交于 2020-01-12 01:54:10
问题 I am trying to understand how bank conflicts take place. if i have an array of size 256 in global memory and i have 256 threads in a single Block, and i want to copy the array to shared memory. therefore every thread copies one element. shared_a[threadIdx.x]=global_a[threadIdx.x] does this simple action result in a bank conflict? suppose now that the size of the array is larger than the number of threads, so i am now using this to copy the global memory to the shared memory: tid = threadIdx.x

nvidia-smi Volatile GPU-Utilization explanation?

时光总嘲笑我的痴心妄想 提交于 2020-01-09 03:03:48
问题 I know that nvidia-smi -l 1 will give the GPU usage every one second (similarly to the following). However, I would appreciate an explanation on what Volatile GPU-Util really means. Is that the number of used SMs over total SMs, or the occupancy, or something else? +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M|

Implement sleep() in OpenCL C [duplicate]

点点圈 提交于 2020-01-08 04:18:16
问题 This question already has an answer here : Calculate run time of kernel code in OpenCL C (1 answer) Closed 4 years ago . I want to measure the performance of different devices viz CPU and GPUs. This is my kernel code: __kernel void dataParallel(__global int* A) { sleep(10); A[0]=2; A[1]=3; A[2]=5; int pnp;//pnp=probable next prime int pprime;//previous prime int i,j; for(i=3;i<10;i++) { j=0; pprime=A[i-1]; pnp=pprime+2; while((j<i) && A[j]<=sqrt((float)pnp)) { if(pnp%A[j]==0) { pnp+=2; j=0; }

Why very simple Renderscript runs 3 times slower in GPU than in CPU

谁说胖子不能爱 提交于 2020-01-07 04:07:26
问题 My test platform: Development OS: Windows 7 32-bit Phone: Nexus 5 Phone OS version: Android 4.4 SDK bundle: adt-bundle-windows-x86-20131030 Build-tool version: 19 SDK tool version: 22.3 Platform tool version: 19 I wrote a very simple Renderscript as follows: #pragma rs_fp_relaxed uchar4 __attribute__((kernel)) someKernel(uchar4 in, uint32_t x, uint32_t y){ return in; } I also used adb shell setprop debug.rs.default-CPU-driver 1 to force the script to run on CPU for performance comparison. I

Dependencies on cutil when using CUDA 5.0

流过昼夜 提交于 2020-01-05 05:51:13
问题 When I run the make command to complie a CUDA program under Linux 64bits, I receive the following error message: error: cutil.h: No such file or directory I found some answers, but none of them useful. In the makefile , there is one CUDA_SDK_PATH , but cannot find anything useful about the SDK in the CUDA Getting Started Guide: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html How should i set to the CUDA-SDK-PATH ? 回答1: If you are planning on using CUDA 5 or later,

How does a GPU group threads into warps/wavefronts?

巧了我就是萌 提交于 2020-01-04 09:38:27
问题 My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block? For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index? Since by doing this, one can minimize

Performing many small matrix operations in parallel in OpenCL

旧城冷巷雨未停 提交于 2020-01-04 06:52:55
问题 I have a problem that requires me to do eigendecomposition and matrix multiplication of many (~4k) small (~3x3) square Hermitian matrices. In particular, I need each work item to perform eigendecomposition of one such matrix, and then perform two matrix multiplications. Thus, the work that each thread has to do is rather minimal, and the full job should be highly parallelizable. Unfortunately, it seems all the available OpenCL LAPACKs are for delegating operations on large matrices to the GPU

Some child grids not being executed with CUDA Dynamic Parallelism

我与影子孤独终老i 提交于 2020-01-04 05:19:08
问题 I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch. Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time . I wrote a little test program to illustrate

How to efficiently gather data from threads in CUDA?

半世苍凉 提交于 2020-01-03 09:24:13
问题 I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host? I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel. I really like to optimize this. The main problem is concatenate a variable number of