gpu

Run-time GPU or CPU execution?

懵懂的女人 提交于 2021-01-29 03:51:51
问题 I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution. My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a

Run-time GPU or CPU execution?

ε祈祈猫儿з 提交于 2021-01-29 03:46:00
问题 I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution. My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a

Tensorflow-GPU not using GPU with CUDA,CUDNN

不想你离开。 提交于 2021-01-28 11:15:11
问题 I want to use Tensorflow on GPU. So I install all the needed tool and installed as below- CUDA-11.2 CUDNN-11.1 Anaconda-2020.11 Tensorflow-GPU-2.3.0 I tested that my cuda,cudnn is working using deviseQuery example. But Tensorflow not used GPU. Then i find that version compatibility issue is possible so i innstalled CudaToolkit,cudnn using conda environment checking with version compatibility on Tensorflow website which is given below. CUDA-10.2.89 CUDNN-7.6.5 Tensorflow-GPU-2.3.0 But after

Run Snakemake rule one sample at a time

痞子三分冷 提交于 2021-01-28 10:36:27
问题 I'm creating a Snakemake workflow that will wrap up some of the tools in the nvidia clara parabricks pipelines. Because these tools run on GPU's, they typically can only handle one sample at a time, otherwise the GPU will run out of memory. However, Snakemake shoves all the samples through to Parabricks at one time - seemingly unaware of the GPU memory limits. One solution would be to tell Snakemake to process one sample at a time, thus the question: How do I get Snakemake to process one

Run Snakemake rule one sample at a time

你。 提交于 2021-01-28 10:34:03
问题 I'm creating a Snakemake workflow that will wrap up some of the tools in the nvidia clara parabricks pipelines. Because these tools run on GPU's, they typically can only handle one sample at a time, otherwise the GPU will run out of memory. However, Snakemake shoves all the samples through to Parabricks at one time - seemingly unaware of the GPU memory limits. One solution would be to tell Snakemake to process one sample at a time, thus the question: How do I get Snakemake to process one

Run Snakemake rule one sample at a time

十年热恋 提交于 2021-01-28 10:33:48
问题 I'm creating a Snakemake workflow that will wrap up some of the tools in the nvidia clara parabricks pipelines. Because these tools run on GPU's, they typically can only handle one sample at a time, otherwise the GPU will run out of memory. However, Snakemake shoves all the samples through to Parabricks at one time - seemingly unaware of the GPU memory limits. One solution would be to tell Snakemake to process one sample at a time, thus the question: How do I get Snakemake to process one

why CUDA doesn't result in speedup in C++ code?

笑着哭i 提交于 2021-01-28 09:29:23
问题 I'm using VS2019 and have an NVIDIA GeForce GPU. I tried the code from this link: https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f The author of that post claims to get a speedup when using CUDA. However, for me, the serial version takes around 7 milliseconds while the CUDA version takes around 28 milliseconds. Why is CUDA slower for this code? The code I used is below: __global__ void add(int n, float* x, float* y) { int index = blockIdx.x * blockDim.x +

How to reduce OpenCL enqueue time/any other ideas?

血红的双手。 提交于 2021-01-27 20:34:40
问题 I have an algorithm and I've been trying to accelerate it using OpenCL on my nVidia. It has to process a large amount of data (let's say 100k to milions), where for each one datum: a matrix (on the device) has to be updated first (using the datum and two vectors); and only after the whole matrix has been updated, the two vectors (also on the device) are updated using the same datum. So, my host code looks something like this for (int i = 0; i < milions; i++) { clSetKernelArg(kernel

Cuda global memory load and store

倾然丶 夕夏残阳落幕 提交于 2021-01-27 19:33:23
问题 So I am trying to hide global memory latency. Take the following code: for(int i = 0; i < N; i++){ x = global_memory[i]; ... do some computation on x ... global_memory[i] = x; } I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code: x_next = global_memory[0]; for(int i = 0; i < N; i++){ x = x_next; x_next = global_memory[i+1]; ... do some computation on x ... global_memory[i] =

The behavior of __CUDA_ARCH__ macro

ぃ、小莉子 提交于 2021-01-27 14:07:10
问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled