cuda

cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

谁说胖子不能爱 提交于 2020-07-04 13:21:08
问题 I get the following error when l run tensorflow in GPU. 2018-09-15 18:56:51.011724: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version Traceback (most recent call last): File "evaluate_sample.py", line 160, in <module> tf.app.run(main) File "/anaconda3/envs/tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "evaluate_sample.py"

CUB::BlockRadixSort: how to deal with the last tile which is not full?

痞子三分冷 提交于 2020-06-29 03:59:29
问题 There are 510 keys for sort. BLOCK_DIM_X = 128, ITEMS_PER_THREAD = 4, thus every tile covers 512 keys. We lauch kenel by 1 block. my kernel looks like this: typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort; int thread_data[4]; BlockLoad(temp_storage.load).Load(in_data, thread_data); CTA_SYNC(); BlockRadixSort(temp_storage.sort).Sort(thread_data); CTA_SYNC(); BlockStore(temp_storage.store).Store(out_data, thread_data); CTA_SYNC(); The problem is BlockRadixSort sort 512 keys, not 510.

atomicCAS for bool implementation

♀尐吖头ヾ 提交于 2020-06-28 23:55:32
问题 I'm trying to figure out is there a bug in the answer (now deleted) about the implementation of Cuda-like atomicCAS for bool s. The code from the answer (reformatted): static __inline__ __device__ bool atomicCAS(bool *address, bool compare, bool val) { unsigned long long addr = (unsigned long long)address; unsigned pos = addr & 3; // byte position within the int int *int_addr = (int *)(addr - pos); // int-aligned address int old = *int_addr, assumed, ival; do { assumed = old; if(val) ival =

How to enable CUDA 7.0+ per-thread default stream in Visual Studio 2013?

微笑、不失礼 提交于 2020-06-28 07:42:11
问题 I followed the method provided in GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and tested it in VS2013 with CUDA 7.5. While the multi-stream example worked, the multi-threading one did not give the expected result. The code is as below: #include <pthread.h> #include <cstdio> #include <cmath> #define CUDA_API_PER_THREAD_DEFAULT_STREAM #include "cuda.h" const int N = 1 << 20; __global__ void kernel(float *x, int n) { int tid = threadIdx.x + blockIdx.x * blockDim.x; for (int i = tid; i < n;

How to pass data bigger than the VRAM size into the GPU?

一世执手 提交于 2020-06-26 15:53:31
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

℡╲_俬逩灬. 提交于 2020-06-26 15:52:06
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

最后都变了- 提交于 2020-06-26 15:51:38
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

Odd-Even Sort using cuda Programming

梦想的初衷 提交于 2020-06-24 10:51:13
问题 I'm trying to implement odd-even sort program in cuda-c language. But, whenever I give a 0 as one of the elements in the input array, the resulted array is not properly sorted.In other cases, however, it is working for other input.I don't understand what is the problem with the code.Here is my code: #include<stdio.h> #include<cuda.h> #define N 5 __global__ void sort(int *c,int *count) { int l; if(*count%2==0) l=*count/2; else l=(*count/2)+1; for(int i=0;i<l;i++) { if(threadIdx.x%2==0) //even

Why does cuda calculate each index in vector addition?

回眸只為那壹抹淺笑 提交于 2020-06-23 20:36:51
问题 In the following example of adding vectors, why should add the statement ( int tid = threadidx.x; )I know this is to determine the index of the thread, but I am not sure why the index should be assigned? __global__ void add(int * dev_a, int* dev_b, int* dev_c){ int tid = threadIdx.x; //index of thread if(tid<N) dev_c[tid] = dev_a[tid] + dev_b[tid]; } int main(){ // main int a[N]; int b[N]; int c[N]; for(int i=0;i<N;i++) a[i]=i; for(int i=0;i<N;i++) b[i]=i; int* dev_a; int* dev_b; int* dev_c;

Why does cuda calculate each index in vector addition?

☆樱花仙子☆ 提交于 2020-06-23 20:34:32
问题 In the following example of adding vectors, why should add the statement ( int tid = threadidx.x; )I know this is to determine the index of the thread, but I am not sure why the index should be assigned? __global__ void add(int * dev_a, int* dev_b, int* dev_c){ int tid = threadIdx.x; //index of thread if(tid<N) dev_c[tid] = dev_a[tid] + dev_b[tid]; } int main(){ // main int a[N]; int b[N]; int c[N]; for(int i=0;i<N;i++) a[i]=i; for(int i=0;i<N;i++) b[i]=i; int* dev_a; int* dev_b; int* dev_c;