gpgpu

What is the current status of C++ AMP [closed]

核能气质少年 提交于 2019-12-20 09:29:20
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I am working on high performance code in C++ and have been using both CUDA and OpenCL and more recently C++AMP, which I like very much. I am however a little worried that it is not being developed and extended and will die out. What leads me to this thought is that even the MS C+

CUDA model - what is warp size?

筅森魡賤 提交于 2019-12-20 08:25:20
问题 What's the relationship between maximum work group size and warp size? Let’s say my device has 240 CUDA streaming processors (SP) and returns the following information - CL_DEVICE_MAX_COMPUTE_UNITS: 30 CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 512 CL_NV_DEVICE_WARP_SIZE: 32 This means it has eight SPs per streaming multiprocessor (that is, compute unit). Now how is warp size = 32 related to these numbers? 回答1: Direct Answer: Warp size is the number of

OpenGL vs. OpenCL, which to choose and why?

六月ゝ 毕业季﹏ 提交于 2019-12-20 07:58:49
问题 What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL? For example, parallel function evaluation can be done by rendering a to a texture using other textures. Reducing operations can be done by iteratively render to smaller and smaller textures. On the other hand, random write access is not possible in any efficient manner (the only way to do is rendering

error: asm operand type size(1) does not match type/size implied by constraint 'r'. On Duane Merrill's GPU radix sort

安稳与你 提交于 2019-12-20 07:22:40
问题 I have an error when I am trying to compile Merrill's radix sort under win-XP + VS2005. error: asm operand type size(1) does not match type/size implied by constraint 'r'. it occurs in the following code #define B40C_DEFINE_GLOBAL_LOAD(base_type, dest_type, short_type, ptx_type, reg_mod)\ asm("ld.global.cg."#ptx_type" %0, [%1];" : "="#reg_mod(dest) : _B40C_ASM_PTR_(d_ptr + offset));\ ... B40C_DEFINE_GLOBAL_LOAD(char, signed char, char, s8, r) Thanks 回答1: This would appear to be caused by

Calculate run time of kernel code in OpenCL C

梦想与她 提交于 2019-12-20 06:17:15
问题 I want to measure the performance (read runtime) of my kernel code on various devices viz CPU and GPUs. The kernel code that I wrote is: __kernel void dataParallel(__global int* A) { sleep(10); A[0]=2; A[1]=3; A[2]=5; int pnp;//pnp=probable next prime int pprime;//previous prime int i,j; for(i=3;i<500;i++) { j=0; pprime=A[i-1]; pnp=pprime+2; while((j<i) && A[j]<=sqrt((float)pnp)) { if(pnp%A[j]==0) { pnp+=2; j=0; } j++; } A[i]=pnp; } } However I have been told that it is not possible to use

Any good resources on design patterns for parallel architectures?

时光怂恿深爱的人放手 提交于 2019-12-20 04:13:17
问题 A bit of background: I am getting started with GPGPU (OpenCL), I am using a java wrapper (jogamp.jocl) hoping that it will provide me with a way to abstract the low level nitty gritty and use standard OOP at higher levels. I can see already from the various Hello World examples that I'll have to manage the queues myself. My question: Are there any known patterns for GPGPU or good resources (as in books) on design patterns for massively parallel architectures in general? My focus is on

OpenCL kernel error on Mac OSx

穿精又带淫゛_ 提交于 2019-12-20 03:28:18
问题 I wrote some OpenCL code which works fine on LINUX, but it is failing with errors on Mac OSX. Can someone please help me to identify why these should occur. The kernel code is shown after the error. My kernel uses double, so I have the corresponding pragma at the top. But I don't know why the error shows float data type: inline float8 __OVERLOAD__ _name(float8 x) { return _default_name(x); } \ ^ /System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:4606:30:

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

时光毁灭记忆、已成空白 提交于 2019-12-20 03:26:08
问题 A warp is 32 threads. Does the 32 threads execute in parallel in a Multiprocessor? If 32 threads are not executing in parallel then there is no race condition in the warp. I got this doubt after going through the some examples. 回答1: In the CUDA programming model, all the threads within a warp run in parallel. But the actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture have 8 cores

What device number should I use (0 or 1), to copy P2P (GPU0->GPU1)?

一笑奈何 提交于 2019-12-19 11:48:56
问题 What number of device do I must to set 0 or 1 in cudaSetDevice(); , to copy P2P (GPU0->GPU1) by using cudaStreamCreate(stream); cudaMemcpyPeerAsync(p1, 1, p0, 0, size, stream); ? Code: // Set device 0 as current cudaSetDevice(0); float* p0; size_t size = 1024 * sizeof(float); // Allocate memory on device 0 cudaMalloc(&p0, size); // Set device 1 as current cudaSetDevice(1); float* p1; // Allocate memory on device 1 cudaMalloc(&p1, size); // Set device 0 as current cudaSetDevice(0); // Launch

In a GLSL fragment shader, how to access to texel at a specific mipmap level?

爱⌒轻易说出口 提交于 2019-12-19 07:23:47
问题 I am using OpenGL to do some GPGPU computations through the combination of one vertex shader and one fragment shader. I need to do computations on a image at different scale. I would like to use mipmaps since their generation can be automatic and hardware accelerated. However I can't manage to get access to the mipmap textures in the fragment shader. I enabled automatic mipmap generation: glTexParameteri(GL_TEXTURE_2D, GL_GENERATE_MIPMAP, GL_TRUE); I tried using texture2DLod in the shader