pycuda

pyCuda, issues sending multiple single variable arguments

给你一囗甜甜゛ 提交于 2021-02-16 21:14:06
问题 I have a pycuda program here that reads in an image from the command line and saves a version back with the colors inverted: import pycuda.autoinit import pycuda.driver as device from pycuda.compiler import SourceModule as cpp import numpy as np import sys import cv2 modify_image = cpp(""" __global__ void modify_image(int pixelcount, unsigned char* inputimage, unsigned char* outputimage) { int id = threadIdx.x + blockIdx.x * blockDim.x; if (id >= pixelcount) return; outputimage[id] = 255 -

How to use Python to run pycuda in multiple processes

爱⌒轻易说出口 提交于 2021-02-11 12:36:34
问题 I have a pycuda code that can run in a single process. Can python's multiple processes support running this code in multiple subprocesses? If I try, I will find that I made a mistake. Did I make a mistake? I tried to use python's process to implement a simple multi-process and found that it would go wrong. import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule from multiprocessing import Pool, Manager, Process def ffunc(i, return_dict, a, b,

Getting started with shared memory on PyCUDA

大城市里の小女人 提交于 2021-02-08 10:35:59
问题 I'm trying to understand shared memory by playing with the following code: import pycuda.driver as drv import pycuda.tools import pycuda.autoinit import numpy from pycuda.compiler import SourceModule src=''' __global__ void reduce0(float *g_idata, float *g_odata) { extern __shared__ float sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do

PyTorch Cuda with anaconda not available

安稳与你 提交于 2021-01-29 10:37:00
问题 I'm using anaconda to regulate my environment, for a project i have to use my GPU for network training. I use pytorch for my project and i'm trying to get CUDA working. I installed cudatoolkit, numba, cudnn still, when i try this command: torch.cuda.is_available() I get "False" as output. This is my environment: # Name Version Build Channel blas 1.0 mkl bzip2 1.0.6 h470a237_2 conda-forge ca-certificates 2018.03.07 0 cairo 1.14.12 he6fea26_5 conda-forge certifi 2018.8.24 py35_1 cffi 1.11.5

PyCUDA: Pow within device code tries to use std::pow, fails

℡╲_俬逩灬. 提交于 2020-11-29 05:50:10
问题 Question more or less says it all. calling a host function("std::pow<int, int> ") from a __device__/__global__ function("_calc_psd") is not allowed from my understanding, this should be using the cuda pow function instead, but it isn't. 回答1: The error is exactly as the compiler is reported. You can't used host functions in device code, and that include the whole host C++ std library. CUDA includes its own standard library, described in the programming guide, but you should use either pow or

TensorRT multiple Threads

跟風遠走 提交于 2020-08-10 19:30:08
问题 I am trying to use TensorRt using the python API. I am trying to use it in multiple threads where the Cuda context is used with all the threads (everything works fine in a single thread). I am using docker with tensorrt:20.06-py3 image, and an onnx model, and Nvidia 1070 GPU. The multiple thread approach should be allowed, as mentioned here TensorRT Best Practices. I created the context in the main thread: cuda.init() device = cuda.Device(0) ctx = device.make_context() I tried two methods,

get “LogicError: explicit_context_dependent failed: invalid device context - no currently active context? ” when running tensorRT in ROS

ⅰ亾dé卋堺 提交于 2020-04-16 05:45:20
问题 I have an inference code in TensorRT(with python). I want to run this code in ROS but I get the below error when trying to allocate buffer: LogicError: explicit_context_dependent failed: invalid device context - no currently active context? The code works well out of the ROS package. A ROS node publishes an image and the given code get the image to do inference. The inference code is shown below: #!/usr/bin/env python # Revision $Id$ import rospy from std_msgs.msg import String from cv_bridge

PyCUDA/CUDA: Causes of non-deterministic launch failures?

跟風遠走 提交于 2020-02-03 08:50:32
问题 Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance) Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation

PyCUDA/CUDA: Causes of non-deterministic launch failures?

余生颓废 提交于 2020-02-03 08:50:17
问题 Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance) Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation

Rank of each element in a matrix row using CUDA

本秂侑毒 提交于 2020-01-26 04:17:05
问题 Is there any way to find the rank of element in a matrix row separately using CUDA or any functions for the same provided by NVidia? 回答1: I don't know of a built-in ranking or argsort function in CUDA or any of the libraries I am familiar with. You could certainly build such a function out of lower-level operations using thrust for example. Here is a (non-optimized) outline of a possible solution approach using thrust: $ cat t84.cu #include <thrust/device_vector.h> #include <thrust/copy.h>