cuda-streams

The behavior of stream 0 (default) and other streams

旧巷老猫 提交于 2021-02-08 09:15:42
问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

Is it possible to manually set the SMs used for one CUDA stream?

强颜欢笑 提交于 2021-02-05 10:51:14
问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

Is it possible to manually set the SMs used for one CUDA stream?

三世轮回 提交于 2021-02-05 10:51:05
问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

Reading updated memory from other CUDA stream

笑着哭i 提交于 2019-12-13 05:06:19
问题 I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following. #include <iostream> #include <cuda.h> #include <cuda_runtime.h> #define FLAGCLEAR 0 #define FLAGSET 1 using namespace std; __global__ void set_flag(int *flag) { *flag = FLAGSET; // Wait for flag to reset. while (*flag == FLAGSET); } __global__ void read_flag(int *flag) { // wait for the flag to set. while (*flag != FLAGSET); // Clear it for next time. *flag = FLAGCLEAR; } int

CUDA streams not overlapping

ぃ、小莉子 提交于 2019-12-12 08:25:45
问题 I have something very similar to the code: int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]); for (k = 0; k < no_streams; k++) mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof

Get rid of busy waiting during asynchronous cuda stream executions

陌路散爱 提交于 2019-12-10 20:23:00
问题 I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs): cudaStream_t steams[S_N]; for (int i = 0; i < S_N; i++) { cudaStreamCreate(streams[i]); } int sid = 0; for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) { while (true) { if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!! cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);

Stream scheduling order

北战南征 提交于 2019-12-10 12:02:19
问题 The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong? allOfData_A= data_A1 + data_A2 allOfData_B= data_B1 + data_B2 allOFData_C= data_C1 + data_C2 Data_C is the output of the kernel operation of both Data_A & Data_B. (Like C=A+B) The HW supports one DeviceOverlap (concurrent) operation. Process One: MemcpyAsync data_A1 stream1 H->D MemcpyAsync data_A2 stream2 H->D MemcpyAsync data_B1 stream1 H->D MemcpyAsync data_B2 stream2

How to reduce CUDA synchronize latency / delay

不想你离开。 提交于 2019-12-07 02:47:05
问题 This question is related to using cuda streams to run many kernels In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty. I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible. Also is there

How to reduce CUDA synchronize latency / delay

亡梦爱人 提交于 2019-12-05 05:27:32
This question is related to using cuda streams to run many kernels In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty. I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible. Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an

CUDA streams not overlapping

社会主义新天地 提交于 2019-12-04 01:55:07
I have something very similar to the code: int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]); for (k = 0; k < no_streams; k++) mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float)); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float),