Stream scheduling order

北战南征 提交于 2019-12-10 12:02:19

问题


The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?

allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the kernel operation of both Data_A & Data_B.  (Like C=A+B)
The HW supports one DeviceOverlap (concurrent) operation.

Process One:

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B1 stream1 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

Process Two: (Same operation, different order)

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_B1 stream1 H->D
sameKernel stream1
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

回答1:


Using CUDA streams allows the programmer to express work dependencies by putting dependent operations in the same stream. Work in different streams is independent and can be executed concurrently.

On GPUs without HyperQ (compute capability 1.0 to 3.0) you can get false dependencies because the work for a DMA engine or for computation gets put into a single hardware pipe. Compute capability 3.5 brings HyperQ which allows for multiple hardware pipes and there you shouldn't get the false dependencies. The simpleHyperQ example illustrates this, and the documentation shows diagrams to explain what is going on more clearly.

Putting it simply, on devices without HyperQ you would need to do a breadth-first launch of your work to get maximum concurrency, whereas for devices with HyperQ you can do a depth-first launch. Avoiding the false dependencies is pretty easy, but not having to worry about it is easier!



来源:https://stackoverflow.com/questions/14837622/stream-scheduling-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!