cuda | 易学教程

cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

阅读更多关于 cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

问题 I get the following error when l run tensorflow in GPU. 2018-09-15 18:56:51.011724: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version Traceback (most recent call last): File "evaluate_sample.py", line 160, in <module> tf.app.run(main) File "/anaconda3/envs/tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "evaluate_sample.py"

CUB::BlockRadixSort: how to deal with the last tile which is not full?

阅读更多关于 CUB::BlockRadixSort: how to deal with the last tile which is not full?

问题 There are 510 keys for sort. BLOCK_DIM_X = 128, ITEMS_PER_THREAD = 4, thus every tile covers 512 keys. We lauch kenel by 1 block. my kernel looks like this: typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort; int thread_data[4]; BlockLoad(temp_storage.load).Load(in_data, thread_data); CTA_SYNC(); BlockRadixSort(temp_storage.sort).Sort(thread_data); CTA_SYNC(); BlockStore(temp_storage.store).Store(out_data, thread_data); CTA_SYNC(); The problem is BlockRadixSort sort 512 keys, not 510.

atomicCAS for bool implementation

阅读更多关于 atomicCAS for bool implementation

问题 I'm trying to figure out is there a bug in the answer (now deleted) about the implementation of Cuda-like atomicCAS for bool s. The code from the answer (reformatted): static __inline__ __device__ bool atomicCAS(bool *address, bool compare, bool val) { unsigned long long addr = (unsigned long long)address; unsigned pos = addr & 3; // byte position within the int int *int_addr = (int *)(addr - pos); // int-aligned address int old = *int_addr, assumed, ival; do { assumed = old; if(val) ival =

How to enable CUDA 7.0+ per-thread default stream in Visual Studio 2013?

阅读更多关于 How to enable CUDA 7.0+ per-thread default stream in Visual Studio 2013?

问题 I followed the method provided in GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and tested it in VS2013 with CUDA 7.5. While the multi-stream example worked, the multi-threading one did not give the expected result. The code is as below: #include <pthread.h> #include <cstdio> #include <cmath> #define CUDA_API_PER_THREAD_DEFAULT_STREAM #include "cuda.h" const int N = 1 << 20; __global__ void kernel(float *x, int n) { int tid = threadIdx.x + blockIdx.x * blockDim.x; for (int i = tid; i < n;

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

Odd-Even Sort using cuda Programming

阅读更多关于 Odd-Even Sort using cuda Programming

问题 I'm trying to implement odd-even sort program in cuda-c language. But, whenever I give a 0 as one of the elements in the input array, the resulted array is not properly sorted.In other cases, however, it is working for other input.I don't understand what is the problem with the code.Here is my code: #include<stdio.h> #include<cuda.h> #define N 5 __global__ void sort(int *c,int *count) { int l; if(*count%2==0) l=*count/2; else l=(*count/2)+1; for(int i=0;i<l;i++) { if(threadIdx.x%2==0) //even

Why does cuda calculate each index in vector addition?

阅读更多关于 Why does cuda calculate each index in vector addition?

问题 In the following example of adding vectors, why should add the statement （ int tid = threadidx.x; ）I know this is to determine the index of the thread, but I am not sure why the index should be assigned? __global__ void add(int * dev_a, int* dev_b, int* dev_c){ int tid = threadIdx.x; //index of thread if(tid<N) dev_c[tid] = dev_a[tid] + dev_b[tid]; } int main(){ // main int a[N]; int b[N]; int c[N]; for(int i=0;i<N;i++) a[i]=i; for(int i=0;i<N;i++) b[i]=i; int* dev_a; int* dev_b; int* dev_c;

Why does cuda calculate each index in vector addition?

阅读更多关于 Why does cuda calculate each index in vector addition?