gpu | 易学教程

Thrust filter by key value

阅读更多关于 Thrust filter by key value

问题 In my application I have a class like this: class sample{ thrust::device_vector<int> edge_ID; thrust::device_vector<float> weight; thrust::device_vector<int> layer_ID; /*functions, zip_iterators etc. */ }; At a given index every vector stores the corresponding data of the same edge. I want to write a function that filters out all the edges of a given layer, something like this: void filter(const sample& src, sample& dest, const int& target_layer){ for(...){ if( src.layer_ID[x] == target_layer

Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

阅读更多关于 Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

问题 I have used the following command to profile my Python code : python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py Then, I can visualize globally the repartition of different greedy functions : As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet : def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500): cc

Will adding GPU cards automatically scale tensorflow usage?

阅读更多关于 Will adding GPU cards automatically scale tensorflow usage?

问题 Suppose I can train with sample size N , batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error. Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow? I'v read, that there are bitcoin or etherium miners, that can

Information/example on Unified Virtual Addressing (UVA) in CUDA

阅读更多关于 Information/example on Unified Virtual Addressing (UVA) in CUDA

问题 I'm trying to understand the concept of Unified Virtual Addressing (UVA) in CUDA. I have two questions: Is there any sample (psudo)code available that demonstrates this concept? I read in the CUDA C Programming Guide that UVA can be used only with 64 bit operating systems. Why it is so? 回答1: A unified virtual address space combines the pointer (values) and allocation mappings used in device code with the pointer (values) and allocation mappings used in host code into a single unified space. 1

Optimal workgroup size for sum reduction in OpenCL

阅读更多关于 Optimal workgroup size for sum reduction in OpenCL

问题 I am using the following kernel for sum reduciton. __kernel void reduce(__global float* input, __global float* output, __local float* sdata) { // load shared mem unsigned int tid = get_local_id(0); unsigned int bid = get_group_id(0); unsigned int gid = get_global_id(0); unsigned int localSize = get_local_size(0); unsigned int stride = gid * 2; sdata[tid] = input[stride] + input[stride + 1]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction in shared mem for(unsigned int s = localSize >> 2; s > 0;

Cholesky decomposition with CUDA

阅读更多关于 Cholesky decomposition with CUDA

问题 I am trying to implement Cholesky decomposition using the cuSOLVER library. I am a beginner CUDA programmer and I have always specified block-sizes and grid-sizes, but I am not able to find out how this can be set explicitly by the programmer with cuSOLVER functions. Here is the documentation: http://docs.nvidia.com/cuda/cusolver/index.html#introduction The QR decomposition is implemented using the cuSOLVER library (see the example here: http://docs.nvidia.com/cuda/cusolver/index.html#ormqr

simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

阅读更多关于 simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

问题 I am new to GPU computing , but somewhere I've read that it's possible to execute a CUDA program without a GPU card using a simulator/ emulator. I have installed NVIDIA's GPU Computing SDK 4.0 and Visual C++ 2010 Express on Windows Vista. I would like to know: Whether it is feasible or not to run CUDA code without a GPU, using NVIDA's Computing SDK 4.0 and Visual C++ 2010 express? Why I get the following error, when I try to execute a sample program I have: ------ Build started: Project:

cuda: warp divergence overhead vs extra arithmetic

阅读更多关于 cuda: warp divergence overhead vs extra arithmetic

问题 Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs. But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic? Consider the following dummy example: verison 1: __device__ int get_D (int A, int B, int C) { //The value A is potentially different for every thread. int D = 0; if (A < 10) D = A*6; else if (A < 17) D = A*6 + B*2; else if (A < 26) D = A*6 + B*2 + C; else D

How to access to GPUs on different nodes in a cluster with Slurm?

阅读更多关于 How to access to GPUs on different nodes in a cluster with Slurm?

问题 I have access to a cluster that's run by Slurm, in which each node has 4 GPUs. I have a code that needs 8 gpus. So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus? So this is the job that I tried to submit via sbatch : #!/bin/bash #SBATCH --gres=gpu:8 #SBATCH --nodes=2 #SBATCH --mem=16000M #SBATCH --time=0-01:00 But then I get the following error: sbatch: error: Batch job submission failed: Requested node configuration is not available Then I changed my

How to force Theano to parallelize an operation on GPU (test case: numpy.bincount)

阅读更多关于 How to force Theano to parallelize an operation on GPU (test case: numpy.bincount)

问题 I am looking for possibility to speed up computation of bincount using GPU. Reference code in numpy: x_new = numpy.random.randint(0, 1000, 1000000) %timeit numpy.bincount(x_new) 100 loops, best of 3: 2.33 ms per loop I want to measure only speed of operation, not the time spent on passing array, so I create a shared variable: x = theano.shared(numpy.random.randint(0, 1000, 1000000)) theano_bincount = theano.function([], T.extra_ops.bincount(x)) This operation is of course highly