gpgpu

Cuda PTX registers declaration and using

本秂侑毒 提交于 2019-12-12 01:46:34
问题 I am trying to reduce number of using registers in my kernel, so I am decide to try inline PTX. This kernel: #define Feedback(a, b, c, d, e) d^e^(a&c)^(a&e)^(b&c)^(b&e)^(c&d)^(d&e)^(a&d&e)^(a&c&e)^(a&b&d)^(a&b&c) __global__ void Test(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res) { res[0] = Feedback( a, b, c, d, e ); res[1] = Feedback( b, c, d, e, f ); res[2] = Feedback( c, d, e, f, j

using magma_dysevd in mex file matlab

感情迁移 提交于 2019-12-11 23:15:19
问题 I try to write use magma library in matlab, so basically I write a mexfunction which incorporate c code using magma function and then compile this mexfunction into mexa64 file, thus I could use in matlab. The mexfunction or source c code is below:(called eig_magma) #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <cuda_runtime_api.h> #include <cublas.h> // includes, project #include "flops.h" #include "magma.h" #include "magma_lapack.h" #include "testings

/usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available

孤街浪徒 提交于 2019-12-11 18:35:35
问题 When I am running computecpp_info $ /usr/local/computecpp/bin/computecpp_info /usr/local/computecpp/bin/computecpp_info: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/bin/computecpp_info) /usr/local/computecpp/bin/computecpp_info: /usr/local/cuda-8.0/lib64/libOpenCL.so.1: no version information available (required by /usr/local/computecpp/bin/computecpp_info) ***********************************************************************

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

邮差的信 提交于 2019-12-11 16:15:43
问题 A warp is 32 threads. Does the 32 threads execute in parallel in a Multiprocessor? If 32 threads are not executing in parallel then there is no race condition in the warp. I got this doubt after going through the some examples. 回答1: In the CUDA programming model, all the threads within a warp run in parallel. But the actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture have 8 cores

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

安稳与你 提交于 2019-12-11 15:53:25
问题 I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04. Here is the minimum code that can reproduce the result: CUDASample.cuh class CUDASample{ public: void AddOneToVector(std::vector<int> &in); }; CUDASample.cu __global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const

x64 allows less threads per block than Win32?

99封情书 提交于 2019-12-11 13:26:36
问题 When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not. I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually? I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to

PyOpenCL returns errors the first run, then only 'invalid program' errors; examples also not working

家住魔仙堡 提交于 2019-12-11 11:57:23
问题 I am trying to run an OpenCL kernel using the pyOpenCL bindings, to run on the GPU. I was trying to load the kernel to my program. I ran my program once and got an error. I ran it again without changing the code and got a different, 'invalid program' error. This keeps happening to my own programs using pyOpenCL and also on example programs. I am able to use OpenCL through the C++ bindings, on both the CPU and GPU, with no problems. So I think this is a problem specific to the pyOpenCL

Several arithmetic operations parallelized in C++Amp

亡梦爱人 提交于 2019-12-11 08:58:49
问题 I am trying to parallelize a convolution filter using C++Amp. I would like the following function to start working (I don't know how to do it properly): float* pixel_color[] = new float [16]; concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array); concurrency::array_view<float, 1> pixel(16, pixel_color); // I don't know which data structure to use here parallel_for_each( pixels.extent, [=](concurrency::index<2> idx) restrict(amp) { int row=idx[0]; int col

Using pointers in C++Amp

心不动则不痛 提交于 2019-12-11 08:49:02
问题 I've got a following issue: I have a code which does a very basic operation. I am passing a pointer to a concurrency::array_view because I wanted to store the values earlier to avoid the bottle-neck in the function which uses multithreading. The problem is the following construction won't compile: parallel_for_each((*pixels).extent, [=](concurrency::index<2> idx) restrict(amp) { int row=idx[0]; int col=idx[1]; (*pixels)(row, col) = (*pixels)(row, col) * (*taps)(row, col); //this is the

Can I use GPUDirect v2 Peer-to-Peer communication between two Quadro K1100M or two GeForce GT 745M?

依然范特西╮ 提交于 2019-12-11 08:46:04
问题 Can I use GPUDirect v2 - Peer-to-Peer communication on a single PCIe-Bus?: between two: Mobile nVidia Quadro K1100M between two: Mobile nVidia GeForce GT 745M 回答1: In general, if you want to find out if GPUDirect Peer to Peer is supported between two GPUs, you can run the simple P2P CUDA sample code or in your own code, you can test the availability with the cudaCanAccessPeer runtime API call Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU