gpu

Writing output files from CUDA devices

主宰稳场 提交于 2019-12-31 03:51:11
问题 I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code. Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf? Sorry, if the answer has already been given in a previous topic, I can't seem to find it... Thanks! 回答1: The short answer is, no there is not. cuPrintf and the built-in printf support in Fermi

CUDA Stream compaction: understanding the concept

徘徊边缘 提交于 2019-12-30 19:27:47
问题 I am using CUDA/Thrust/CUDPP. As I understand, in Stream compaction, certain items in an array are marked as invalid and then "removed". Now what does "removal" really mean here? Suppose the original array A and has length 6. If 2 elements are invalid (by whatever condition we may provide) then Does the system create a new array of size 4 in GPU-memory to store the valid elements to get the final result? OR does it physically remove the invalid elements from memory and shrink the original

vector step addition slower on cuda

梦想与她 提交于 2019-12-30 09:53:13
问题 I am trying to run the vector step addition function on CUDA C++ code, but for large float arrays of size 5,000,000 too, it runs slower than my CPU version. Below is the relevant CUDA and cpu code that I am talking about: #define THREADS_PER_BLOCK 1024 typedef float real; __global__ void vectorStepAddKernel2(real*x, real*y, real*z, real alpha, real beta, int size, int xstep, int ystep, int zstep) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < size) { x[i*xstep] = alpha* y[i*ystep] +

vector step addition slower on cuda

独自空忆成欢 提交于 2019-12-30 09:53:04
问题 I am trying to run the vector step addition function on CUDA C++ code, but for large float arrays of size 5,000,000 too, it runs slower than my CPU version. Below is the relevant CUDA and cpu code that I am talking about: #define THREADS_PER_BLOCK 1024 typedef float real; __global__ void vectorStepAddKernel2(real*x, real*y, real*z, real alpha, real beta, int size, int xstep, int ystep, int zstep) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < size) { x[i*xstep] = alpha* y[i*ystep] +

Is it possible to program GPU for Android

痞子三分冷 提交于 2019-12-30 05:53:08
问题 I am now programming on Android and I wonder whether we can use GPGPU for Android now? I once heard that Renderscript can potentially execute on GPGPU in the future. But I wonder whether it is possible for us to programming on GPGPU now? And if it is possible for me to program on the Android GPGPU, where can I find some tutorials or sample programs? Thank you for your help and suggestions. Up till now I know that the OpenGL ES library was now accelerated use GPU, but I want to use the GPU for

What is actually a Queue family in Vulkan?

大城市里の小女人 提交于 2019-12-30 03:13:08
问题 I am currently learning vulkan, right now I am just taking apart each command and inspecting the structures to try to understand what they mean. Right now I am analyzing QueueFamilies, for which I have the following code: vector<vk::QueueFamilyProperties> queue_families = device.getQueueFamilyProperties(); for(auto &q_family : queue_families) { cout << "Queue number: " + to_string(q_family.queueCount) << endl; cout << "Queue flags: " + to_string(q_family.queueFlags) << endl; } This produces

Matrix-vector multiplication in CUDA: benchmarking & performance

烈酒焚心 提交于 2019-12-29 04:00:23
问题 I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected

Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28]

与世无争的帅哥 提交于 2019-12-29 03:21:26
问题 This is the sample MNIST code I am running: from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('MNIST_data', one_hot=True) import tensorflow as tf sess = tf.InteractiveSession() x = tf.placeholder(tf.float32, shape=[None, 784]) y_ = tf.placeholder(tf.float32, shape=[None, 10]) W = tf.Variable(tf.zeros([784,10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x,W) + b) def weight_variable(shape): initial = tf.truncated_normal(shape, stddev

Android emulator black screen with “GPU emulation - yes”

别来无恙 提交于 2019-12-29 01:51:32
问题 I'm trying to use the new feature of the android avd features "GPU emulation - yes" It's needed to use GLES2.0 on the emulator. However, when I turn that on, the screen of the emulator goes blank. Here is the output when I call it from command line. 回答1: For those having the same problem and having a nvidia card with optimus... Just install ironhide! sudo apt-add-repository ppa:mj-casalogic/ironhide sudo apt-get update sudo apt-get upgrade sudo apt-get install ironhide Then follow the

cuBLAS argmin — segfault if outputing to device memory?

谁说胖子不能爱 提交于 2019-12-29 01:40:09
问题 In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART