cuda | 易学教程

shuffle intrinsics with non-default mask providing data from inactive threads to active threads

阅读更多关于 shuffle intrinsics with non-default mask providing data from inactive threads to active threads

问题 I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step. Examples I've seen on the web: Using CUDA Warp Level Primitives Faster Parallel Reductions -- Kepler The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync() , but only goes as far as a single warp reduction. The second of those links is a Kepler-era article that doesn't use the newer

Array of structs of arrays CUDA C

阅读更多关于 Array of structs of arrays CUDA C

问题 I'm fairly new to CUDA and i've been looking around to create and array of structs of arrays and i found a couple solutions , but none gives me a clear idea . here Harrism explained a pass by value for a struct which works fine, but when trying to add this approach to it i get illegal memory access . what im trying to achieve is an array of structs each struct with a pointer to a dynamically allocated array populated on the host and my kernel to be able to read values of array from desired

Don't understand why column addition faster than row in CUDA

阅读更多关于 Don't understand why column addition faster than row in CUDA

问题 I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n. __global__ void th_single_row_add(float* a, float* b, float* c, int n) { int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n; for (int i = 0; i < n; i ++) { if (idx + i >= n*n) return; c[idx + i] = a[idx + i] + b[idx + i]; } } __global__ void th_single_col_add(float* a, float* b, float* c, int n) { int idx = blockDim.x * blockIdx.x + threadIdx.x; for (int i = 0;

No GPU activities in profiling with nvprof

阅读更多关于 No GPU activities in profiling with nvprof

问题 I run nvprof.exe on the function that initialize data, calls three kernels and free's data. All profiled as it should and I got result like this: ==7956== Profiling application: .\a.exe ==7956== Profiling result: GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*) 43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*) 4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*) API

Change cuda::GpuMat values through custom kernel

阅读更多关于 Change cuda::GpuMat values through custom kernel

问题 I am using a kernel to "loop" over a live camera stream to highlight specific color regions. These can not always be reconstructed with some cv::threshold s, therefor I am using a kernel. The current kernel is as following: __global__ void customkernel(unsigned char* input, unsigned char* output, int width, int height, int colorWidthStep, int outputWidthStep) { const int xIndex = blockIdx.x * blockDim.x + threadIdx.x; const int yIndex = blockIdx.y * blockDim.y + threadIdx.y; if ((xIndex <

How to enable C++17 code generation in VS2019 CUDA project

阅读更多关于 How to enable C++17 code generation in VS2019 CUDA project

问题 I am moving some code from VS2017 on one pc to another pc with VS2019. Everything is fine excepted that I cannot use std::filesystem. In my former code, I was using C++14 and had std::experimental::filesystem. In the new code, I want to move to C++17 so I changed to std::filesystem (as shown in my code below). The weird thing is that intellisense (not sure it is the right name of the thing) shows no error. It even displays filesystem when I type std::f... But the code won't build and give the

Calculate the angle between two triangles in CUDA

阅读更多关于 Calculate the angle between two triangles in CUDA

问题 I wanted to calculate the angle between two triangles in 3D space. The two triangles will always share exactly two points. e.g. Triangle 1: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point3 (x3, y3, z3). Triangle 2: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point4 (x4, y4, z4). Is there a way to calculate the angle between them efficiently in CUDA? 回答1: For each plane, you need to construct it's normal vector (perpendicular to all lines in that plane). The simple way to do that is to take the

Run-time GPU or CPU execution?

阅读更多关于 Run-time GPU or CPU execution?

问题 I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution. My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a

Artificially downgrade CUDA compute capabilities to simulate other hardware

阅读更多关于 Artificially downgrade CUDA compute capabilities to simulate other hardware

问题 I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that. Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with

Run-time GPU or CPU execution?

阅读更多关于 Run-time GPU or CPU execution?