gpu

Thrust filter by key value

心不动则不痛 提交于 2019-12-24 05:38:10
问题 In my application I have a class like this: class sample{ thrust::device_vector<int> edge_ID; thrust::device_vector<float> weight; thrust::device_vector<int> layer_ID; /*functions, zip_iterators etc. */ }; At a given index every vector stores the corresponding data of the same edge. I want to write a function that filters out all the edges of a given layer, something like this: void filter(const sample& src, sample& dest, const int& target_layer){ for(...){ if( src.layer_ID[x] == target_layer

Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

天涯浪子 提交于 2019-12-24 03:48:09
问题 I have used the following command to profile my Python code : python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py Then, I can visualize globally the repartition of different greedy functions : As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet : def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500): cc

Will adding GPU cards automatically scale tensorflow usage?

和自甴很熟 提交于 2019-12-24 02:23:18
问题 Suppose I can train with sample size N , batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error. Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow? I'v read, that there are bitcoin or etherium miners, that can

Information/example on Unified Virtual Addressing (UVA) in CUDA

痞子三分冷 提交于 2019-12-24 01:43:30
问题 I'm trying to understand the concept of Unified Virtual Addressing (UVA) in CUDA. I have two questions: Is there any sample (psudo)code available that demonstrates this concept? I read in the CUDA C Programming Guide that UVA can be used only with 64 bit operating systems. Why it is so? 回答1: A unified virtual address space combines the pointer (values) and allocation mappings used in device code with the pointer (values) and allocation mappings used in host code into a single unified space. 1

Optimal workgroup size for sum reduction in OpenCL

╄→гoц情女王★ 提交于 2019-12-24 00:38:25
问题 I am using the following kernel for sum reduciton. __kernel void reduce(__global float* input, __global float* output, __local float* sdata) { // load shared mem unsigned int tid = get_local_id(0); unsigned int bid = get_group_id(0); unsigned int gid = get_global_id(0); unsigned int localSize = get_local_size(0); unsigned int stride = gid * 2; sdata[tid] = input[stride] + input[stride + 1]; barrier(CLK_LOCAL_MEM_FENCE); // do reduction in shared mem for(unsigned int s = localSize >> 2; s > 0;

Cholesky decomposition with CUDA

让人想犯罪 __ 提交于 2019-12-23 19:32:16
问题 I am trying to implement Cholesky decomposition using the cuSOLVER library. I am a beginner CUDA programmer and I have always specified block-sizes and grid-sizes, but I am not able to find out how this can be set explicitly by the programmer with cuSOLVER functions. Here is the documentation: http://docs.nvidia.com/cuda/cusolver/index.html#introduction The QR decomposition is implemented using the cuSOLVER library (see the example here: http://docs.nvidia.com/cuda/cusolver/index.html#ormqr

simple CUDA program execution without GPU hardware using NVIDIA GPU computing SDK 4.0 and microsoft VC++ 2010 express

僤鯓⒐⒋嵵緔 提交于 2019-12-23 19:18:24
问题 I am new to GPU computing , but somewhere I've read that it's possible to execute a CUDA program without a GPU card using a simulator/ emulator. I have installed NVIDIA's GPU Computing SDK 4.0 and Visual C++ 2010 Express on Windows Vista. I would like to know: Whether it is feasible or not to run CUDA code without a GPU, using NVIDA's Computing SDK 4.0 and Visual C++ 2010 express? Why I get the following error, when I try to execute a sample program I have: ------ Build started: Project:

cuda: warp divergence overhead vs extra arithmetic

妖精的绣舞 提交于 2019-12-23 18:21:58
问题 Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs. But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic? Consider the following dummy example: verison 1: __device__ int get_D (int A, int B, int C) { //The value A is potentially different for every thread. int D = 0; if (A < 10) D = A*6; else if (A < 17) D = A*6 + B*2; else if (A < 26) D = A*6 + B*2 + C; else D

How to access to GPUs on different nodes in a cluster with Slurm?

徘徊边缘 提交于 2019-12-23 18:12:20
问题 I have access to a cluster that's run by Slurm, in which each node has 4 GPUs. I have a code that needs 8 gpus. So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus? So this is the job that I tried to submit via sbatch : #!/bin/bash #SBATCH --gres=gpu:8 #SBATCH --nodes=2 #SBATCH --mem=16000M #SBATCH --time=0-01:00 But then I get the following error: sbatch: error: Batch job submission failed: Requested node configuration is not available Then I changed my

How to force Theano to parallelize an operation on GPU (test case: numpy.bincount)

半城伤御伤魂 提交于 2019-12-23 17:47:10
问题 I am looking for possibility to speed up computation of bincount using GPU. Reference code in numpy: x_new = numpy.random.randint(0, 1000, 1000000) %timeit numpy.bincount(x_new) 100 loops, best of 3: 2.33 ms per loop I want to measure only speed of operation, not the time spent on passing array, so I create a shared variable: x = theano.shared(numpy.random.randint(0, 1000, 1000000)) theano_bincount = theano.function([], T.extra_ops.bincount(x)) This operation is of course highly