gpgpu

Do While don't work inside CUDA Kernel

怎甘沉沦 提交于 2019-12-13 04:37:36
问题 Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost. I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50. I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel. And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop. #define SEED 35791246 __shared__ int niter; _

Concurrently running two for loops with same number of loop cycles involving GPU and CPU tasks on two GPU

ε祈祈猫儿з 提交于 2019-12-13 04:33:23
问题 I have two for loops in my code running the same number of loop cycles. These two loops are independent (each loop works on different input data). Within one loop, there are CPU functions and several kernels not running concurrently. Can I run these iterations on separate GPUs? 回答1: You can run the involved kernels separately on two different GPUs. You have to take care about synchronization of the CPU processings on the partial outcomes of the two GPUs. Due to the presence of a sequential

Generate Index using CUDA-C

只谈情不闲聊 提交于 2019-12-13 04:31:50
问题 I am trying to generate set of indices below: I have a cuda block that consists of 20 blocks(blockIdx:from 0 -19) with each block subdivided into 4 blocks (sub block Idx: 0,1,2 and 3). I am trying to generate index pattern like this : threadIdx (tid),SubBlockIdxA(SA),SubBlockIdxB(SB), BlockIdxA(BA),BlockIdxB(BB) Required Obtained tid SBA SBB BA BB SBA SBB BA BB 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 2 0 1 1 1 0 1 1 1 3 1 0 1 2 1 0 1 2 4 0 1 2 2 0 1 2 2 5 1 0 2 3 1 0 2 3 6 0 1 3 3 0 1 3 3 7 1 0 3

Simple Thrust code performs about half as fast as my naive cuda kernel. Am I using Thrust wrong?

天涯浪子 提交于 2019-12-13 03:51:34
问题 I'm pretty new to Cuda and Thrust, but my impression was that Thrust, when used well, is supposed to offer better performance than naively written Cuda kernels. Am I using Thrust in a sub-optimal way? Below is a complete, minimal example that takes an array u of length N+2 , and for each i between 1 and N computes the average 0.5*(u[i-1] + u[i+1]) and puts the result in uNew[i] . ( uNew[0] is set to u[0] and u[N+1] is set to u[N+1] so that the boundary terms don't change). The code performs

How to reduce nonconsecutive segments of numbers in array with Thrust

喜你入骨 提交于 2019-12-13 03:06:03
问题 I have 1D array "A" which is composed from many arrays "a" like this : I'm implementing a code to sum up non consecutive segments ( sum up the numbers in the segments of the same color of each array "a" in "A" as follow: Any ideas to do that efficiently with thrust? Thank you very much Note: The pictures represents only one array "a". The big array "A" contains many arrays "a" 回答1: In the general case, where the ordering of the data and grouping by segments is not known in advance, the

clGetProgramBuildInfo does not print build log

荒凉一梦 提交于 2019-12-13 02:17:36
问题 I have written a code in OpenCL. There is an error while building the kernel program. The error code is -11. I tried printing the BUILD LOG but it does not print a proper log but instead it generates some random variables. Here is that part //these are variable declarations cl_device_id* devices; cl_program kernelprgrm; size_t size; //these varaibles have already been assigned properly //main code clGetProgramBuildInfo(kernelprgrm,devices[i], CL_PROGRAM_BUILD_LOG ,0,NULL,&size); char

When to use volatile with register/local variables

吃可爱长大的小学妹 提交于 2019-12-12 19:03:14
问题 What is the meaning of declaring register arrays in CUDA with volatile qualifier? When I tried with volatile keyword with a register array, it removed the number of spilled register memory to local memory. (i.e. Force the CUDA to use registers instead of local memory) Is this the intended behavior? I did not find any information about the usage of volatile with regard to register arrays in CUDA documentation. Here is the ptxas -v output for both versions With volatile qualifier __volatile__

gnupg get_key failed in php

社会主义新天地 提交于 2019-12-12 17:06:51
问题 I am using gnupg for digital sign file in php . It was working fine before. Suddenly I am getting this error: PHP Fatal error: Uncaught exception 'Exception' with message 'get_key failed' putenv("GNUPGHOME=/tmp"); $publicKey = file_get_contents("./media/public.key"); $gpg = new gnupg(); $gpg->seterrormode(gnupg::ERROR_EXCEPTION); $info = $gpg->import($publicKey); $gpg->addsignkey($info['fingerprint'], DIGITAL_FILE_PASS); $signed = $gpg->sign($data_to_sign); What could be the reason?

how to compile cuda kernel without optimizing at all?

隐身守侯 提交于 2019-12-12 14:44:12
问题 If i compile this __global__ void dummy_kernel(float *a, int N, float* b, int N2){ unsigned int i = blockIdx.y*blockDim.y + threadIdx.y; unsigned int j = blockIdx.x*blockDim.x + threadIdx.x; } i get this empty ptx code .entry _Z9dummy_kernelPfiS_i( .param .u64 _Z9dummy_kernelPfiS_i_param_0, .param .u32 _Z9dummy_kernelPfiS_i_param_1, .param .u64 _Z9dummy_kernelPfiS_i_param_2, .param .u32 _Z9dummy_kernelPfiS_i_param_3 ) { ret; } Is there a way to force the compiler to generate ptx without

Are CUDA .ptx files portable?

a 夏天 提交于 2019-12-12 11:35:44
问题 I'm studying the cudaDecodeD3D9 sample to learn how CUDA works, and at compilation it generates a .ptx file from a .cu file. This .ptx file is, as I understand it so far, an intermediate representation that will be compiled just-in-time for any specific GPU. The sample uses the class cudaModuleMgr to load this file via cuModuleLoadDataEx. The .ptx file is in text format, and I can see that at the top of it is a bunch of hardcoded paths on my machine, including my user folder, i.e.: .file 1 "C