gpgpu

Retaining dot product on GPGPU using CUBLAS routine

时光总嘲笑我的痴心妄想 提交于 2019-12-17 20:53:57
问题 I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU? 回答1: You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to

Can string data types be used in C++ CUDA kernels?

為{幸葍}努か 提交于 2019-12-17 20:41:12
问题 I am writing a CUDA kernel in which I'm using the string data type in C++. However, the compiler is throwing the following error : error: calling a host function("std::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator =") from a __device__/__global__ function("doDecompression") is not allowed Are strings not allowed within a kernel? if not, what is the workaround to allocate space for a char array within a kernel? 回答1: You cannot use the C++ string type in a kernel

Double precision floating point in CUDA

喜欢而已 提交于 2019-12-17 19:13:43
问题 Does CUDA support double precision floating point numbers? Also, what are the reasons for the same? 回答1: If your GPU has compute capability 1.3 then you can do double precision. You should be aware though that 1.3 hardware has only one double precision FP unit per MP, which has to be shared by all the threads on that MP, whereas there are 8 single precision FPUs, so each active thread has its own single precision FPU. In other words you may well see 8x worse performance with double precision

Branch predication on GPU

早过忘川 提交于 2019-12-17 16:08:03
问题 I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches. For example I have a code like this: if (C) A else B so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then

How to use OpenCL on Android?

亡梦爱人 提交于 2019-12-17 15:32:44
问题 For plattform independence (desktop, cloud, mobile, ...) it would be great to use OpenCL for GPGPU development when speed does matter. I know Google pushes RenderScript as an alternative, but it seems to be only be available for Android and is unlikely to be ever included in iOS. Therefore I seek for a solution to execute OpenCL code within Android Apps. 回答1: The only Android devices I know that support OpenCL are the ones based on the Mali T600 family of chips (article here). They have an

how to use gpu::Stream in OpenCV?

给你一囗甜甜゛ 提交于 2019-12-17 13:28:24
问题 OpenCV has gpu::Stream class that encapsulates a queue of asynchronous calls. Some functions have overloads with the additional gpu::Stream parameter. Aside from gpu-basics-similarity.cpp sample code, there is very little information in OpenCV documentation on how and when to use gpu::Stream . For example, it is not very clear (to me) what exactly gpu::Stream::enqueueConvert or gpu::Stream::enqueueCopy do, or how to use gpu::Stream as additional overload parameter. I'm looking for some

how to use gpu::Stream in OpenCV?

梦想的初衷 提交于 2019-12-17 13:27:49
问题 OpenCV has gpu::Stream class that encapsulates a queue of asynchronous calls. Some functions have overloads with the additional gpu::Stream parameter. Aside from gpu-basics-similarity.cpp sample code, there is very little information in OpenCV documentation on how and when to use gpu::Stream . For example, it is not very clear (to me) what exactly gpu::Stream::enqueueConvert or gpu::Stream::enqueueCopy do, or how to use gpu::Stream as additional overload parameter. I'm looking for some

How does CUDA assign device IDs to GPUs?

余生颓废 提交于 2019-12-17 07:30:53
问题 When a computer has multiple CUDA-capable GPUs, each GPU is assigned a device ID . By default, CUDA kernels execute on device ID 0 . You can use cudaSetDevice(int device) to select a different device. Let's say I have two GPUs in my machine: a GTX 480 and a GTX 670. How does CUDA decide which GPU is device ID 0 and which GPU is device ID 1 ? Ideas for how CUDA might assign device IDs (just brainstorming): descending order of compute capability PCI slot number date/time when the device was

Does it make sense to use an unsigned short integer for registers and shared memory?

怎甘沉沦 提交于 2019-12-14 03:58:30
问题 Does it make sense to use an unsigned short integer for registers (for saving register's memory) and shared memory (faster access) in CUDA programs? I create template device function (using registers and shared memory) and specialize it for uint and ushort. Use: For uint: 25 registers and speed 460 MB/sec. For ushort: 26 registers and speed 420 MB/sec. So there is no reason to use unsigned short int. 回答1: I don't have big experience with CUDA, but I've read, that we should avoid using

Read float values from RGBAFloat texture in Unity 3D

北慕城南 提交于 2019-12-14 03:58:24
问题 It seems people aren't discussing much around floating point textures. I used them to do some computations and then forward the result to another surface shader (to obtain some specific deformations) and that's cool, it always works for me if I digest the results in a shader but this time I need to get those values CPU side so I get a float[] array with the results (just after calling Graphics.Blit that fills the floating point texture). How can this be achieved? On a side note: the only guy