Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

雨燕双飞 提交于 2019-12-04 19:52:40

First, you can query some values:

CL_DEVICE_WAVEFRONT_WIDTH_AMD
CL_DEVICE_SIMD_WIDTH_AMD
CL_DEVICE_WARP_SIZE_NV
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

but only from host side as I know.

Lets assume these queries returned 64 and your question gives importance to threads' implicit synchronization.

What if someone chooses local range = 4?

Since opencl abstracts hardware clockwork from developer, you can't know what actual SIMD or WAVEFRONT size is from within kernel execution in runtime.

For example, AMD NCU has 64 shaders but it has 16-wide SIMD, 8-wide SIMD, 4-wide SIMD, 2-wide SIMD and even two scalar units inside same compute unit.

4 local threads could be shared on two scalars and one 2-wide unit or any other combination of SIMDs. Kernel code can't know this. Even if it knows somehow computing things, you can't know which SIMD combination will be used for next kernel execution(or even next workgroup) at runtime in a random compute-unit(64 shaders).

Or a GCN CU, which has 4x16 SIMDs in it, could allocate 1 thread per SIMD, making all 4 threads totally independent. If they all reside in same SIMD, youre lucky. There is no guarantee knowing that "before" kernel execution. Even after you know, next kernel could be different since there is no guarantee of choosing same SIMD allocation(background kernels, 3d visualization softwares, even OS could be putting bubbles in pipelines)

There is no guarantee of commanding/hinting/querying of N threads to run as same SIMD or same WARP before kernel execution. Then in the kernel, there is no command to get a thread's wavefront index just like get_global_id(0). Then after kernel, you can't rely on array results since next kernel execution may not use same SIMDs for exact same items. Even some items from other wavefronts could be swapped with an item from current wavefront just for an optimization by driver or hardware (nvidia has loadbalancer lately and could have been doing this, also NCU of amd may use similar thing in future)

Even if you guess right combination of threads on SIMDs on your hardware and driver, it could be totally different in another computer.


If its for a performance point of view, you could try:

  • zero-branch in kernel code
  • zero kernels running in background
  • gpu is not being used for monitor output
  • gpu is not being used for some visualization software

Just to make sure %99 probability, there are no bubbles in pipelines so all threads retire an instruction at the same cycle(or at least synchronize at latest retiring one).

Or, add a fence after every instruction to synchronize on global or local memory which is very slow. Fences make workitem level synhronization, barriers make local group level synchronization. There are no wavefront synchronization commands.

Then, those threads that run within same SIMD will behave synchronized. But you may not know which threads those are and which SIMDs.

For the 4-thread example, using float16 for all calculations may let the driver use 16-wide SIMDs of AMD GCN CU to compute but then they are not threads anymore, only variables. But this should have better synchronization on data, than threads.

There are more complex situations such as:

  • 4 threads in same SIMD but one thread calculation generates some NaN value and does an extra normalization(taking 1-2 cycle maybe) on that. 3 others should wait for completion but it works independently of data related slowdowns.

  • 4 threads in same wavefront are in a loop and one of them stuck forever. 3 of them wait for the 4th one to finish forever or driver detects and moves it to another free-empty SIMD? Or 4th one waits for other 3 at the same time because they are not moving too!

  • 4 threads doing atomic operations, one by one.

  • Amd's HD5000 series gpu has SIMD width 4(or 5) but wavefront size is 64.

Wavefronts guarantee lockstep. That's why on older compilers you can omit the synchronizations if your local group contains only one wavefront. (You can no longer do this on newer compilers, who will interprete the dependency wrong and give you wrong code. But on the other hand newer compilers will omit the synchronizations for you if your local group contains only one wavefront.)

One stream processer is like one core of CPU. It will repeatly run one 16-wide vector instruction four times to fulfill the 64 so called "threads" in a wavefront. Actually one wavefront is more a thread than what we call a thread on a GPU.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!