cuda | 易学教程

Artificially downgrade CUDA compute capabilities to simulate other hardware

阅读更多关于 Artificially downgrade CUDA compute capabilities to simulate other hardware

问题 I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that. Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with

thrust copy_if: incomplete type is not allowed

阅读更多关于 thrust copy_if: incomplete type is not allowed

问题 I'm trying to use thrust::copy_if to compact an array with a predicate checking for positive numbers: header file: file.h: struct is_positive { __host__ __device__ bool operator()(const int x) { return (x >= 0); } }; and file.cu #include "../headers/file.h" #include <thrust/device_ptr.h> #include <thrust/device_vector.h> #include <thrust/copy.h> void compact(int* d_inputArray, int* d_outputArray, const int size) { thrust::device_ptr<int> t_inputArray(d_inputArray); thrust::device_ptr<int> t

CUDA nvcc building chain of libraries

阅读更多关于 CUDA nvcc building chain of libraries

问题 My goal is: library2.so is using library1.so and mycode.o is using (libs should be linked) library2.so (and maybe library1.so ). The source code is (one line header files are omitted): library1.cu: __device__ void func1_lib1(void){} library2.cu: #include "library1.h" __global__ void func1_lib2(void) { func1_lib1(); } extern "C" void func2_lib2(void) { func1_lib2<<<1,1>>>(); } mycode.c: #include "library2.h" int main(void) { func2_lib2(); } I'm building the shared libraries according to with

CUDA nvcc building chain of libraries

阅读更多关于 CUDA nvcc building chain of libraries

CUDA nvcc building chain of libraries

阅读更多关于 CUDA nvcc building chain of libraries

From non coalesced access to coalesced memory access CUDA

阅读更多关于 From non coalesced access to coalesced memory access CUDA

问题 I was wondering if there is any simple way to transform a non-coalesced memory access into a coalesced one. Let's take the example of this array: dW[[w0,w1,w2][w3,w4,w5][w6,w7][w8,w9]] Now, i know that if Thread 0 in block 0 access dW[0] and then Thread 1 in block 0 access dw[1] , that's a coalesced access in the global memory. The problem is that i have two operations. The first one is coalesced as described above. But the second one isn't because Thread 1 in block 0 needs to do an operation

why CUDA doesn't result in speedup in C++ code?

阅读更多关于 why CUDA doesn't result in speedup in C++ code?

问题 I'm using VS2019 and have an NVIDIA GeForce GPU. I tried the code from this link: https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f The author of that post claims to get a speedup when using CUDA. However, for me, the serial version takes around 7 milliseconds while the CUDA version takes around 28 milliseconds. Why is CUDA slower for this code? The code I used is below: __global__ void add(int n, float* x, float* y) { int index = blockIdx.x * blockDim.x +

Bit twiddle help: Expanding bits to follow a given bitmask

阅读更多关于 Bit twiddle help: Expanding bits to follow a given bitmask

问题 I'm interested in a fast method for "expanding bits," which can be defined as the following: Let B be a binary number with n bits, i.e. B \in {0,1}^ n Let P be the position of all 1/true bits in B , i.e. 1 << p[i] & B == 1 , and | P |= k For another given number, A \in {0,1}^ k , let Ap be the bit-expanded form of A given B , such that Ap[j] == A[j] << p[j] . The result of the "bit expansion" is Ap . A couple examples: Given B : 00 1 0 111 0, A : 0110, then Ap should be 00 0 0 110 0 Given B :

Sending the same data to N GPUs

阅读更多关于 Sending the same data to N GPUs

问题 I have 4 GPUs hung off the same PCIe switch (PLX PEX 8747) on a Haswell based system. I want to send the same data to each GPU. Is it possible for the PCIe switch to replicate the data to N targets, rather than do N separate transfers? In effect is it possible to broadcast data to N GPUs over the PCIe bus? I was wondering how SLI / Crosssfire handled such issues? I can imagine large amounts of data being identical identical for each GPU in a given scene being rendered. I remember reading

CMake 3.11 Linking CUBLAS

阅读更多关于 CMake 3.11 Linking CUBLAS

问题 How do I correctly link to CUBLAS in CMake 3.11 ? In particular, I'm trying to create a CMakeLists file for this code. CMakeLists file so far: cmake_minimum_required(VERSION 3.8 FATAL_ERROR) project(cmake_and_cuda LANGUAGES CXX CUDA) add_executable(mmul_2 mmul_2.cu) This gives multiple "undefined reference errors" to cublas and curand. 回答1: Found the solution which is to add this line in the end of the CMakeLists file: target_link_libraries(mmul_2 -lcublas -lcurand) 来源： https://stackoverflow