gpgpu

CUDA-parallelized raytracer: very low speedup

ぐ巨炮叔叔 提交于 2019-12-23 03:48:06
问题 I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU. For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to

Can consecutive CUDA atomic operations on global memory benefit from L2 cache?

走远了吗. 提交于 2019-12-23 03:38:25
问题 In a cache-enabled CUDA device, does locality of references in consecutive atomic operations on global memory addresses by one thread benefit from L2 cache? For example, I have an atomic operation in a CUDA kernel that uses the returned value. uint a = atomicAnd( &(GM_addr[index]), b ); I'm thinking if I'm about to use atomic by the thread in the same kernel again , if I can confine the address of new atomic operation to 32-byte long [ &(GM_addr[index&0xFFFFFFF8]), &(GM_addr[index|7]) ]

How to generate pseudo random in cuda

萝らか妹 提交于 2019-12-23 02:36:24
问题 I am attempting to build a particle system utilizing CUDA to do the heavy lifting. I want to randomize some of the particles' initial values like velocity and life span. The random numbers don't have to be super random since it's just for visual effect. I found this post that addresses the same subject: Random Number Generation in CUDA That suggests a linear congruential is the way to go. It seems like it should be simple to implement, but I am having trouble getting anything useful of my

Compute shader not writing to SSBO

a 夏天 提交于 2019-12-23 02:36:07
问题 I'm writing a simple test compute shader that writes a value of 5.0 to every element in a buffer. The buffer's values are initialized to -1, so that I know whether or not creating the buffer and reading the buffer are the problem. class ComputeShaderWindow : public QOpenGLWindow { public: void initializeGL() { // Create the opengl functions object gl = context()->versionFunctions<QOpenGLFunctions_4_3_Core>(); m_compute_program = new QOpenGLShaderProgram(this); auto compute_shader_s = fs:

The impact of goto instruction at intra-warp divergence in CUDA code

ぐ巨炮叔叔 提交于 2019-12-22 13:58:16
问题 For simple intra-warp thread divergence in CUDA, what I know is that SM selects a re-convergence point (PC address), and executes instructions in both/multiple paths while disabling effects of execution for the threads that haven't taken the path. For example, in below piece of code: if( threadIdx.x < 16 ) { A: // do something. } else { B: // do something else. } C: // rest of code. C is the re-convergence point, warp scheduler schedules instructions at both A and B , while disabling

How to profile PyCuda code in Linux?

末鹿安然 提交于 2019-12-22 12:38:19
问题 I have a simple (tested) pycuda app and am trying to profile it. I've tried NVidia's Compute Visual Profiler, which runs the program 11 times, then emits this error: NV_Warning: Ignoring the invalid profiler config option: fb0_subp0_read_sectors Error : Profiler data file '/home/jguy/proj/gpu/tdbp/pyArch/temp_compute_profiler_0_0.csv' does not contain profiler output.This can happen when: a) Profiling is disabled during the entire run of the application. b) The application does not invoke any

Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints

喜夏-厌秋 提交于 2019-12-22 06:39:46
问题 I'm using nsight 2.2 , Toolkit 4.2 , latest nvidia driver , I'm using couple gpu's in my computer. Build customize 4.2. I have set "generate GPU ouput" on CUDA's project properties, nsight monitor is on (everything looks great). I set several break points on my global - kernel function . nsight stops at the declaration of the function , but skips over several break points. it's just like nsight decide whether to hit a break point or skip over a break point. The funny thing is that nsight

How do I use the GPU available with OpenMP?

自古美人都是妖i 提交于 2019-12-22 05:57:11
问题 I am trying to get some code to run on the GPU using OpenMP, but I am not succeeding. In my code, I am performing a matrix multiplication using for loops: once using OpenMP pragma tags and once without. (This is so that I can compare the execution time.) After the first loop I call omp_get_num_devices() (this is my main test to see if I'm actually connecting to a GPU.) No matter what I try, omp_get_num_devices() always returns 0. The computer I am using has two NVIDIA Tesla K40M GPUs . CUDA 7

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

夙愿已清 提交于 2019-12-22 01:36:42
问题 As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture 4.1. SIMT Architecture ... A warp executes one common instruction at a time , so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially

How to configure OpenCL in visual studio2010 for nvidia's gpu on windows?

折月煮酒 提交于 2019-12-21 20:44:59
问题 I am using NVIDIA's GeForce GTX 480 GPU on Wwindows 7 operating system on my ASUS laptop. I have already configured Visual Studio 2010 for CUDA 4.2. How to configure OpenCL for nvidia's gpu on visual studio 2010?? Have tries every possible way. Is it possible by any way to use 'CUDA toolkit (CUDA 4.2)' and 'nvidia's gpu computing sdk' to program OpenCL? If yes then How? If no then what is other way? 回答1: Yes. You should be able to use Visual Studio 2010 to program for OpenCL. It should simply