gpu | 易学教程

Run-time GPU or CPU execution?

阅读更多关于 Run-time GPU or CPU execution?

问题 I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution. My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a

Run-time GPU or CPU execution?

阅读更多关于 Run-time GPU or CPU execution?

Tensorflow-GPU not using GPU with CUDA,CUDNN

阅读更多关于 Tensorflow-GPU not using GPU with CUDA,CUDNN

问题 I want to use Tensorflow on GPU. So I install all the needed tool and installed as below- CUDA-11.2 CUDNN-11.1 Anaconda-2020.11 Tensorflow-GPU-2.3.0 I tested that my cuda,cudnn is working using deviseQuery example. But Tensorflow not used GPU. Then i find that version compatibility issue is possible so i innstalled CudaToolkit,cudnn using conda environment checking with version compatibility on Tensorflow website which is given below. CUDA-10.2.89 CUDNN-7.6.5 Tensorflow-GPU-2.3.0 But after

Run Snakemake rule one sample at a time

阅读更多关于 Run Snakemake rule one sample at a time

问题 I'm creating a Snakemake workflow that will wrap up some of the tools in the nvidia clara parabricks pipelines. Because these tools run on GPU's, they typically can only handle one sample at a time, otherwise the GPU will run out of memory. However, Snakemake shoves all the samples through to Parabricks at one time - seemingly unaware of the GPU memory limits. One solution would be to tell Snakemake to process one sample at a time, thus the question: How do I get Snakemake to process one

Run Snakemake rule one sample at a time

阅读更多关于 Run Snakemake rule one sample at a time

Run Snakemake rule one sample at a time

阅读更多关于 Run Snakemake rule one sample at a time

why CUDA doesn't result in speedup in C++ code?

阅读更多关于 why CUDA doesn't result in speedup in C++ code?

问题 I'm using VS2019 and have an NVIDIA GeForce GPU. I tried the code from this link: https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f The author of that post claims to get a speedup when using CUDA. However, for me, the serial version takes around 7 milliseconds while the CUDA version takes around 28 milliseconds. Why is CUDA slower for this code? The code I used is below: __global__ void add(int n, float* x, float* y) { int index = blockIdx.x * blockDim.x +

How to reduce OpenCL enqueue time/any other ideas?

阅读更多关于 How to reduce OpenCL enqueue time/any other ideas?

问题 I have an algorithm and I've been trying to accelerate it using OpenCL on my nVidia. It has to process a large amount of data (let's say 100k to milions), where for each one datum: a matrix (on the device) has to be updated first (using the datum and two vectors); and only after the whole matrix has been updated, the two vectors (also on the device) are updated using the same datum. So, my host code looks something like this for (int i = 0; i < milions; i++) { clSetKernelArg(kernel

Cuda global memory load and store

阅读更多关于 Cuda global memory load and store

问题 So I am trying to hide global memory latency. Take the following code: for(int i = 0; i < N; i++){ x = global_memory[i]; ... do some computation on x ... global_memory[i] = x; } I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code: x_next = global_memory[0]; for(int i = 0; i < N; i++){ x = x_next; x_next = global_memory[i+1]; ... do some computation on x ... global_memory[i] =

The behavior of __CUDA_ARCH__ macro

阅读更多关于 The behavior of __CUDA_ARCH__ macro

问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled