cuda | 易学教程

CUDA complains about nvcc being an “unsupported toolchain”

阅读更多关于 CUDA complains about nvcc being an “unsupported toolchain”

问题 I've made a 1D convolution program in CUDA - but for some reason the executable doesn't run as CUDA complains "the provided PTX was compiled with an unsupported toolchain" (this error is thrown on the first CUDA library function). My program was compiled with nvcc, with the command I used being exactly: nvcc program.cu -o program and the command I used to run the resultant executable: ./program . Googling returns little to no results - any help? 回答1: This issue has been solved. The problem

Why is there a warp-level synchronization primitive in CUDA?

阅读更多关于 Why is there a warp-level synchronization primitive in CUDA?

问题 I have two questions regarding __syncwarp() in CUDA: If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary? Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not

Compiling using nvcc gives “No such file or directory”

阅读更多关于 Compiling using nvcc gives “No such file or directory”

问题 I'm trying to compile CUDA code using nvcc on Ubuntu. However, when I do, I get this output: > make /usr/local/cuda/bin/nvcc -m64 --ptxas-options="-v" -gencode arch=compute_11,code=sm_11 -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -o main main.cu gcc: No such file or directory make: *** [main] Error 1 Even when I'm trying to compile a file with only a main function in it, it still doesn't work:

Compiling using nvcc gives “No such file or directory”

阅读更多关于 Compiling using nvcc gives “No such file or directory”

CUDA-Kernel supposed to be dynamic crashes depending upon block size

阅读更多关于 CUDA-Kernel supposed to be dynamic crashes depending upon block size

问题 I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS. My kernel looks like the following: __global__ void krnlSpMVmul1( float *data_mat, int num_nonzeroes, unsigned int *row_ptr, float *data_vec, float *data_result) { extern __shared__ float local_result[]; local_result[threadIdx.x] = 0; float vector_elem = data_vec[blockIdx.x]; unsigned int start_index = row_ptr[blockIdx.x];

CUDA-Kernel supposed to be dynamic crashes depending upon block size

阅读更多关于 CUDA-Kernel supposed to be dynamic crashes depending upon block size

Ubuntu16.04安装编译caffe以及一些问题记录

阅读更多关于 Ubuntu16.04安装编译caffe以及一些问题记录

前期准备：最好是python虚拟环境【anaconda的创建虚拟环境】创建 conda create -n caffeEnv(虚拟环境名字) python=3.6 激活环境 source activate caffeEnv 关闭 deactivate 【python virtualenv创建虚拟环境】创建 pip install virtualenv sudo apt-get virtualenv virtualenv caffeEnv(虚拟环境名字) -p /usr/bin/python3（版本）激活 cd caffeEnv && source ./bin/activate 关闭 deactivate 环境条件深度学习加速模块和opencv cuda8.0+cudnn5.1+opencv3.4.0 cuda9.1+cudnn7.0+opencv3.4.0 （我试过8.0+5.1和9.1+7.0都可以）安装教程另外两片博客记录了安装cuda和cudnn教程安装opencv教程 caffe依赖库 sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler sudo apt-get

pyCuda, issues sending multiple single variable arguments

阅读更多关于 pyCuda, issues sending multiple single variable arguments

问题 I have a pycuda program here that reads in an image from the command line and saves a version back with the colors inverted: import pycuda.autoinit import pycuda.driver as device from pycuda.compiler import SourceModule as cpp import numpy as np import sys import cv2 modify_image = cpp(""" __global__ void modify_image(int pixelcount, unsigned char* inputimage, unsigned char* outputimage) { int id = threadIdx.x + blockIdx.x * blockDim.x; if (id >= pixelcount) return; outputimage[id] = 255 -

Is it possible to use CUDA in order to compute the frequency of elements inside a sorted array efficiently?

阅读更多关于 Is it possible to use CUDA in order to compute the frequency of elements inside a sorted array efficiently?

问题 I'm very new to Cuda, I've read a few chapters from books and read a lot of tutorials online. I have made my own implementations on vector addition and multiplication. I would like to move a little further, so let's say we want to implement a function that takes as an input a sorted array of integers. Our goal is to find the frequencies of each integer that is in the array. Sequentially we could scan the array one time in order to produce the output. The time complexity would be O(n) . Since

error when copying dynamically allocated data in device to host?

阅读更多关于 error when copying dynamically allocated data in device to host?

问题 I recently meet a problem when copying dynamically allocated data in device to host memory. The data is allocated with malloc, and I copy those data from device to host in host function. Here is the code: #include <cuda.h> #include <stdio.h> #define N 100 __device__ int* d_array; __global__ void allocDeviceMemory() { d_array = new int[N]; for(int i=0; i < N; i++) d_array[i] = 123; } int main() { allocDeviceMemory<<<1, 1>>>(); cudaDeviceSynchronize(); int* d_a = NULL; cudaMemcpyFromSymbol(