cuda

Artificially downgrade CUDA compute capabilities to simulate other hardware

你说的曾经没有我的故事 提交于 2021-01-29 03:41:03
问题 I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that. Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with

thrust copy_if: incomplete type is not allowed

南笙酒味 提交于 2021-01-29 03:10:50
问题 I'm trying to use thrust::copy_if to compact an array with a predicate checking for positive numbers: header file: file.h: struct is_positive { __host__ __device__ bool operator()(const int x) { return (x >= 0); } }; and file.cu #include "../headers/file.h" #include <thrust/device_ptr.h> #include <thrust/device_vector.h> #include <thrust/copy.h> void compact(int* d_inputArray, int* d_outputArray, const int size) { thrust::device_ptr<int> t_inputArray(d_inputArray); thrust::device_ptr<int> t

CUDA nvcc building chain of libraries

心不动则不痛 提交于 2021-01-29 00:16:39
问题 My goal is: library2.so is using library1.so and mycode.o is using (libs should be linked) library2.so (and maybe library1.so ). The source code is (one line header files are omitted): library1.cu: __device__ void func1_lib1(void){} library2.cu: #include "library1.h" __global__ void func1_lib2(void) { func1_lib1(); } extern "C" void func2_lib2(void) { func1_lib2<<<1,1>>>(); } mycode.c: #include "library2.h" int main(void) { func2_lib2(); } I'm building the shared libraries according to with

CUDA nvcc building chain of libraries

孤街浪徒 提交于 2021-01-29 00:09:35
问题 My goal is: library2.so is using library1.so and mycode.o is using (libs should be linked) library2.so (and maybe library1.so ). The source code is (one line header files are omitted): library1.cu: __device__ void func1_lib1(void){} library2.cu: #include "library1.h" __global__ void func1_lib2(void) { func1_lib1(); } extern "C" void func2_lib2(void) { func1_lib2<<<1,1>>>(); } mycode.c: #include "library2.h" int main(void) { func2_lib2(); } I'm building the shared libraries according to with

CUDA nvcc building chain of libraries

你离开我真会死。 提交于 2021-01-29 00:05:09
问题 My goal is: library2.so is using library1.so and mycode.o is using (libs should be linked) library2.so (and maybe library1.so ). The source code is (one line header files are omitted): library1.cu: __device__ void func1_lib1(void){} library2.cu: #include "library1.h" __global__ void func1_lib2(void) { func1_lib1(); } extern "C" void func2_lib2(void) { func1_lib2<<<1,1>>>(); } mycode.c: #include "library2.h" int main(void) { func2_lib2(); } I'm building the shared libraries according to with

From non coalesced access to coalesced memory access CUDA

霸气de小男生 提交于 2021-01-28 18:53:36
问题 I was wondering if there is any simple way to transform a non-coalesced memory access into a coalesced one. Let's take the example of this array: dW[[w0,w1,w2][w3,w4,w5][w6,w7][w8,w9]] Now, i know that if Thread 0 in block 0 access dW[0] and then Thread 1 in block 0 access dw[1] , that's a coalesced access in the global memory. The problem is that i have two operations. The first one is coalesced as described above. But the second one isn't because Thread 1 in block 0 needs to do an operation

why CUDA doesn't result in speedup in C++ code?

笑着哭i 提交于 2021-01-28 09:29:23
问题 I'm using VS2019 and have an NVIDIA GeForce GPU. I tried the code from this link: https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f The author of that post claims to get a speedup when using CUDA. However, for me, the serial version takes around 7 milliseconds while the CUDA version takes around 28 milliseconds. Why is CUDA slower for this code? The code I used is below: __global__ void add(int n, float* x, float* y) { int index = blockIdx.x * blockDim.x +

Bit twiddle help: Expanding bits to follow a given bitmask

六月ゝ 毕业季﹏ 提交于 2021-01-28 04:32:28
问题 I'm interested in a fast method for "expanding bits," which can be defined as the following: Let B be a binary number with n bits, i.e. B \in {0,1}^ n Let P be the position of all 1/true bits in B , i.e. 1 << p[i] & B == 1 , and | P |= k For another given number, A \in {0,1}^ k , let Ap be the bit-expanded form of A given B , such that Ap[j] == A[j] << p[j] . The result of the "bit expansion" is Ap . A couple examples: Given B : 00 1 0 111 0, A : 0110, then Ap should be 00 0 0 110 0 Given B :

Sending the same data to N GPUs

陌路散爱 提交于 2021-01-28 02:05:58
问题 I have 4 GPUs hung off the same PCIe switch (PLX PEX 8747) on a Haswell based system. I want to send the same data to each GPU. Is it possible for the PCIe switch to replicate the data to N targets, rather than do N separate transfers? In effect is it possible to broadcast data to N GPUs over the PCIe bus? I was wondering how SLI / Crosssfire handled such issues? I can imagine large amounts of data being identical identical for each GPU in a given scene being rendered. I remember reading

CMake 3.11 Linking CUBLAS

好久不见. 提交于 2021-01-27 21:03:54
问题 How do I correctly link to CUBLAS in CMake 3.11 ? In particular, I'm trying to create a CMakeLists file for this code. CMakeLists file so far: cmake_minimum_required(VERSION 3.8 FATAL_ERROR) project(cmake_and_cuda LANGUAGES CXX CUDA) add_executable(mmul_2 mmul_2.cu) This gives multiple "undefined reference errors" to cublas and curand. 回答1: Found the solution which is to add this line in the end of the CMakeLists file: target_link_libraries(mmul_2 -lcublas -lcurand) 来源: https://stackoverflow