gpgpu

Fast Fourier transforms on GPU on iOS

£可爱£侵袭症+ 提交于 2019-12-05 05:18:00
I am implementing compute intensive applications for iOS (i.e., iPhone or iPad) that heavily use fast Fourier transforms (and some signal processing operations such as interpolations and resampling). What are the best libraries and API that allows for running FFTs on iOS? I have briefly looked into Apple Metal as well as Apple vDSP. I wasn't sure that vDSP utilizes GPUs although it seems to be highly parallelized and utilizes SIMD. Metal seems to allow to access GPU for compute intensive apps, but I am not able to find libraries for FFT and basic signal processing operations (something like

OpenCL dynamic parallelism / GPU-spawned threads?

做~自己de王妃 提交于 2019-12-05 03:27:09
问题 CUDA 5 has just been released and with it the ability to spawn GPU threads from within another GPU (main?) thread, minimising callouts between CPU and GPU that we've seen thus far. What plans are there to support GPU-spawned threads in the OpenCL arena? As I cannot afford to opt for a closed standard (my user base is "everygamer"), I need to know when OpenCL is ready for prime time in this regard. 回答1: OpenCL Standard is usually the way back of CUDA (except for device partitioning feature)

printing from cuda kernels

。_饼干妹妹 提交于 2019-12-05 02:28:18
I am writing a cuda program and trying to print something inside the cuda kernels using the printf function. But when I am compiling the program then I am getting an error error : calling a host function("printf") from a __device__/__global__ function("agent_movement_top") is not allowed error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2008 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing

In OpenCL, what does mem_fence() do, as opposed to barrier()?

邮差的信 提交于 2019-12-05 00:50:14
Unlike barrier() (which I think I understand), mem_fence() does not affect all items in the work group. The OpenCL spec says (section 6.11.10), for mem_fence() : Orders loads and stores of a work-item executing a kernel. (so it applies to a single work item). But, at the same time, in section 3.3.1, it says that: Within a work-item memory has load / store consistency. so within a work item the memory is consistent. So what kind of thing is mem_fence() useful for? It doesn't work across items, yet isn't needed within an item... Note that I haven't used atomic operations (section 9.5 etc). Is

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

雨燕双飞 提交于 2019-12-04 19:52:40
As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture 4.1. SIMT Architecture ... A warp executes one common instruction at a time , so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete

Manually set a 1D Texture in Metal

佐手、 提交于 2019-12-04 19:17:52
I'm trying to fill a 1D texture with values manually and pass that texture to a compute shader (these are 2 pixels that I want to set via code, they don't represent any image). Due to the current small amount of Metal examples, all examples I could find deal with 2D textures that load the texture by converting a loaded UIImage to raw bytes data, but creating a dummy UIImage felt like a hack for me. This is the "naive" way I started with - ... var manualTextureData: [Float] = [ 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0 ]; let region: MTLRegion = MTLRegionMake1D(0, textureDescriptor.width);

Why should I use the CUDA Driver API instead of CUDA Runtime API?

落爺英雄遲暮 提交于 2019-12-04 14:29:16
问题 Why should I use the CUDA Driver API, and in which cases I can't use CUDA Runtime API (which is more convenient than Driver API)? 回答1: The runtime API is an higher level of abstraction over the driver API and it's usually easier to use (the performance gap should be minimal). The driver API is a handle-based one and provides a higher degree of control. The runtime API, on the contrary, is easier to use (e.g. you can use the kernel<<<>>> launch syntax). That " higher degree of control " means

How to configure OpenCL in visual studio2010 for nvidia's gpu on windows?

六月ゝ 毕业季﹏ 提交于 2019-12-04 14:23:23
I am using NVIDIA's GeForce GTX 480 GPU on Wwindows 7 operating system on my ASUS laptop. I have already configured Visual Studio 2010 for CUDA 4.2. How to configure OpenCL for nvidia's gpu on visual studio 2010?? Have tries every possible way. Is it possible by any way to use 'CUDA toolkit (CUDA 4.2)' and 'nvidia's gpu computing sdk' to program OpenCL? If yes then How? If no then what is other way? KLee1 Yes. You should be able to use Visual Studio 2010 to program for OpenCL. It should simply be a case of making sure that you have the right include directories and libraries setup. Take a look

Why does those Google image processing sample Renderscript runs slower on GPU in Nexus 5

三世轮回 提交于 2019-12-04 13:16:34
问题 I'd like to thank Stephen for the very quick reply in a previous post. This is a follow up question for this post Why very simple Renderscript runs 3 times slower in GPU than in CPU My dev platform is as follows Development OS: Windows 7 32-bit Phone: Nexus 5 Phone OS version: Android 4.4 SDK bundle: adt-bundle-windows-x86-20131030 Build-tool version: 19 SDK tool version: 22.3 Platform tool version: 19 In order to evaluate the performance of Renderscript GPU compute and to grasp the general

How many 'CUDA cores' does each multiprocessor of a GPU have?

柔情痞子 提交于 2019-12-04 12:12:07
问题 I know that devices before the Fermi architecture had 8 SPs in a single multiprocessor. Is the count same in Fermi architecture? 回答1: The number of Multiprocessors (MP) and the number of cores per MP can be found by executing DeviceQuery.exe . It is found in the %NVSDKCOMPUTE_ROOT%/C/bin directory of the GPU Computing SDK installation. A look at the code of DeviceQuery (found in %NVSDKCOMPUTE_ROOT%/C/src/DeviceQuery ) reveals that it the number of cores is calculated by passing the x.y CUDA