gpu | 易学教程

thrust::copy doesn't work for device_vectors [duplicate]

阅读更多关于 thrust::copy doesn't work for device_vectors [duplicate]

问题 This question already has an answer here : cuda thrust::remove_if throws “thrust::system::system_error” for device_vector? (1 answer) Closed 3 years ago . I copied this code from the Thrust documentation: #include <thrust/copy.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> int main() { thrust::device_vector<int> vec0(100); thrust::device_vector<int> vec1(100); thrust::copy(vec0.begin(), vec0.end(), vec1.begin()); return 0; } When I run this in Debug mode (VS2012), my

device function pointers as struct members

阅读更多关于 device function pointers as struct members

问题 I have this (working) CPU code: #define NF 3 int ND; typedef double (*POT)(double x, double y); typedef struct { POT pot[NF]; } DATAMPOT; DATAMPOT *datampot; double func0(double x, double y); double func1(double x, double y); double func2(double x, double y); int main(void) { int i; ND=5; datampot=(DATAMPOT *)malloc(ND*sizeof(DATAMPOT)); for(i=0;i<ND;i++){ datampot[i].pot[0]=func0; datampot[i].pot[1]=func1; datampot[i].pot[2]=func2; } return 0; } Now I try a GPU version like this #define NF 3

How can I make IDCT run faster on my GPU?

阅读更多关于 How can I make IDCT run faster on my GPU?

问题 I am trying to optimize IDCT from this code for the GPU. The GPU I have on my system in NVIDIA Tesla k20c . The IDCT function as written in the original code looks like this: void IDCT(int32_t *input, uint8_t *output) { int32_t Y[64]; int32_t k, l; for (k = 0; k < 8; k++) { for (l = 0; l < 8; l++) Y(k, l) = SCALE(input[(k << 3) + l], S_BITS); idct_1d(&Y(k, 0)); } for (l = 0; l < 8; l++) { int32_t Yc[8]; for (k = 0; k < 8; k++) Yc[k] = Y(k, l); idct_1d(Yc); for (k = 0; k < 8; k++) { int32_t r

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

阅读更多关于 Cuda Kernel with reduction - logic errors for dot product of 2 matrices

问题 I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong. This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks! Here is the regular code float* ha = new float[n]; //

Training Multi-GPU on Tensorflow: a simpler way?

阅读更多关于 Training Multi-GPU on Tensorflow: a simpler way?

问题 I have been using the training method proposed in the cifar10_multi_gpu_train example for (local) multi-gpu training, i.e., creating several towers and then average the gradient. However, I was wondering the following: What does happen if I just take the losses coming from the different GPUs, sum them up and then just apply gradient descent to that new loss. Would that work? Probably this is a silly question, and there must be a limitation somewhere. So I would be happy if you could comment

CUDA to solve many “small/moderate” linear systems

阅读更多关于 CUDA to solve many “small/moderate” linear systems

问题 Some background info on the problem I am trying to speed up using CUDA: I have a large number of small/moderate same-sized linear systems I need to solve independently. Each linear system is square, real, dense, invertible, and non-symmetric. These are actually matrix systems so each system look like, AX = B, where A, X, and B are (n x n) matrixes. In this previous question I ask CUBLAS batch and matrix sizes, where I learn cuBLAS batch operations give best performance for matrix of size

NVIDIA cuda GPU computing questions

阅读更多关于 NVIDIA cuda GPU computing questions

问题 I installed tensorflow-gpu on win10. I am trying a keras trainning example to test the GPU computing. I loaded all the cuda successfully but show the following: Train on 60000 samples, validate on 10000 samples Epoch 1/100 I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate (GHz) 1.038 pciBusID 0000:01:00.0 Total memory: 3.00GiB Free

CUDA 7.5 install on Mac missing nvrtc

阅读更多关于 CUDA 7.5 install on Mac missing nvrtc

问题 According to the documentation, when I install the CUDA 7.5 Toolkit on my Mac (OSX 10.11) I should get the nvrtc files with it. I do not. Where do I pick up the nvrtc header files and libraries? Were they supposed to be in the bundle and left out? Were the deprecated or replaced with something else? 回答1: So the trick is: 1) Install XCode (from the App Store) FIRST. After the App Store is done installing it, you have to go into your Application menu and actually run it and accept the license.

How to include and use OpenCv3.1.0 library to CUDA file(.cu)?

阅读更多关于 How to include and use OpenCv3.1.0 library to CUDA file(.cu)?

问题 I tried to implement my own kernel to median filter like this pseudo code: //main.cpp #include "opencv2/opencv.hpp" cv::Mat inputMat = cv::imread() cudaMedianCaller (inputMat, kernelMat) //medianFilter.h #include "opencv2/opencv.hpp" cudaMedianCaller (const cv::Mat& inputMat, cv::Mat& kernelMat); //medianFilter.cu cudaMedianCaller (const cv::Mat& inputMat, cv::Mat& kernelMat) { kernelMedianFilter<<< , >>> (uchar3* d_inputMat, uchar* d_kernelMat) } __global__ void kernelMedianFilter (uchar3* d

Information on current GPU Architectures

阅读更多关于 Information on current GPU Architectures

问题 I have decided that my bachelors thesis will be about general purpose GPU-computing and which problems are more suitable for this than others. I am also trying to find out if there are any major differences between the current GPU architectures that may affect this. I am currently looking for some scientific papers and/or information directly from the manufacturers about the current GPU Architectures , but I can't seem to find anything that looks detailed enough. Therefore, I am hoping that