gpu | 易学教程

How can I write the memory pointer in CUDA [duplicate]

阅读更多关于 How can I write the memory pointer in CUDA [duplicate]

问题 This question already has an answer here : Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA (1 answer) Closed 2 years ago . I declared two GPU memory pointers, and allocated the GPU memory, transfer data and launch the kernel in the main: // declare GPU memory pointers char * gpuIn; char * gpuOut; // allocate GPU memory cudaMalloc(&gpuIn, ARRAY_BYTES); cudaMalloc(&gpuOut, ARRAY_BYTES); // transfer the array to the GPU cudaMemcpy(gpuIn, currIn, ARRAY

Why we do not have access to device memory on host side?

阅读更多关于 Why we do not have access to device memory on host side?

问题 I asked a question Memory allocated using cudaMalloc() is accessable by host or not? though the things are much clear to me now, but I am still wondering why it is not possible to access the device pointer in host. My understanding is that the CUDA driver takes care of memory allocation inside GPU DRAM. So this information (that what is my first address of allocated memory in device), can be conveyed to the OS running on host. Then it can be possible to access this device pointer i.e the

OpenCl: Minimal configuration to work with AMD GPU

阅读更多关于 OpenCl: Minimal configuration to work with AMD GPU

问题 Suppose we have AMD GPU (for example Radeon HD 7970) and minimal linux system without X and etc. What should be installed and what should be launched and how it should be launched to have proper OpenCL environment? In best case it should be headless environment. Requirements to environment: GPU visible by OpenCL programs ( clinfo for example) It is possible to monitor temperature and set fan speed (for example using aticonfig ). P.S. Simple install Xserver, catalyst and run X :0 won't work

Why cv::gpu::GaussianBlur is slower, than cv::GaussianBlur?

阅读更多关于 Why cv::gpu::GaussianBlur is slower, than cv::GaussianBlur?

问题 I'm not a pro in C++, OpenCV and CUDA, and don't understand why cv::gpu::warpPerspective(g_mask, g_frame, warp_matrix, g_frame.size(), cv::INTER_LINEAR, cv::BORDER_CONSTANT, cv::Scalar(255,255,255)); cv::gpu::GaussianBlur(g_frame, g_frame, cv::Size(blur_radius, blur_radius), 0); g_frame.download(mask); is slower, than cv::gpu::warpPerspective(g_mask, g_frame, warp_matrix, g_frame.size(), cv::INTER_LINEAR, cv::BORDER_CONSTANT, cv::Scalar(255,255,255)); g_frame.download(mask); cv::GaussianBlur

OpenGL: How to render 2 simple VAOs on Intel HD Graphics 4000 GPU?

阅读更多关于 OpenGL: How to render 2 simple VAOs on Intel HD Graphics 4000 GPU?

问题 Summary: My original question and observations are followed by an updated working OpenGL code for Intel HD Graphics 4000 GPU. Original question: Two cubes are shown on Nvidia NVS 4200m GPU and 1 cube shown on Intel HD Graphics 4000 GPU. Using OpenGL 3.2 forward profile and OpenTK to render 2 simple cubes on screen It shows the first cube centered at (0,0,0) on Intel HD Graphics 4000 with the latest GPU driver 7/2/2014 ver 10.18.0010.3621. It should show 2 cubes. We're using a Vertex Array

GPU vs CPU render mode Adobe AIR

阅读更多关于 GPU vs CPU render mode Adobe AIR

问题 I had the following question: BitmapData lock and unlock not working on android Now, encountering that issue, reading about render mode, I'm very confused how a script that simple fails in GPU mode,but is very fast on CPU mode. So the question is, how GPU mode works and how CPU mode works for adobe air? And why on GPU most of the stuff works better, but not that script Note: Base bitmap size should be bigger than 1400x1400 回答1: There are some limitations on GPU render mode. Adobe recommends

Why is my pcl cuda code running in CPU instead of GPU?

阅读更多关于 Why is my pcl cuda code running in CPU instead of GPU?

问题 I have a code where I use the pcl/gpu namespace: pcl::gpu::Octree::PointCloud clusterCloud; clusterCloud.upload(cloud_filtered->points); pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree); octree_device->setCloud(clusterCloud); octree_device->build(); /*tree->setCloud (clusterCloud);*/ // Create the cluster extractor object for the planar model and set all the parameters std::vector<pcl::PointIndices> cluster_indices; pcl::gpu::EuclideanClusterExtraction ec; ec.setClusterTolerance (0

Keras multi_gpu_model causes system to crash

阅读更多关于 Keras multi_gpu_model causes system to crash

问题 I am trying to train a rather large LSTM on a large dataset and have 4 GPUs to distribute the load. If I try to train on just one of them (any of them, I've tried each) it functions correctly, but after adding the multi_gpu_model code it crashes my entire system when I try to run it. Here is my multi-gpu code batch_size = 8 model = Sequential() model.add(Masking(mask_value=0., input_shape=(len(inputData[0]), len(inputData[0][0])) )) model.add(LSTM(256, return_sequences=True)) model.add

“Peer access” failed when using pycuda and tensorflow together

阅读更多关于 “Peer access” failed when using pycuda and tensorflow together

问题 I have some codes in python3 like this: import numpy as np import pycuda.driver as cuda from pycuda.compiler import SourceModule, compile import tensorflow as tf # create device and context cudadevice=cuda.Device(gpuid1) cudacontext=cudadevice.make_context() config = tf.ConfigProto() config.gpu_options.visible_device_list={}.format(gpuid2) sess = tf.Session(config=config) # compile from a .cu file cuda_mod = SourceModule(cudaCode, include_dirs = [dir_path], no_extern_c = True, options = ['-O0

location of cudaEventRecord and overlapping ops from different streams

阅读更多关于 location of cudaEventRecord and overlapping ops from different streams

问题 I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1). I also record start and stop time of D, R, H ops of each task using cudaEventRecord. I have GeForce GT 555M, CUDA 4.1, and Fedora 16. I have three scenarios: Scenario1: I use one stream for each task. I place start/stop