What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

问题

There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

There are requirements to use shfl operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia

requirement for nvidia

please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.

Also noted that HIP supports shfl for 64 wavesize (WARP-size) on AMD: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly

In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.

But which of AMD GPUs does support functions shfl, or does any AMD GPU support shfl because on AMD GPU it implemented by using Local-memory without hardware instruction register-to-register?

nVidia GPU required 3.0 or higher compute capable (CUDA CC), but what are the requirements for using shfl operations on AMD GPU using HIP C++?

回答1:

Yes, there are new instructions in GPU GCN3 such as ds_bpermute and ds_permute which can provide the functionality such as __shfl() and even more
These ds_bpermute and ds_permute instructions use only route of Local memory (LDS 8.6 TB/s), but don't actually use Local memory, this allows to accelerate data exchange between threads: 8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.

Also there are Data-Parallel Primitives (DPP) - is especially powerful when you can use it since an op can read registers of neighboring workitems directly. I.e. DPP can access to neighboring thread (workitem) at full speed ~51.6 TB/s

http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

now, most of the vector instructions can do cross-lane reading at full throughput.

For example, wave_shr-instruction (Wavefront shift right) for Scan algorithm:

More about GCN3: https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf

New Instructions

“SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.

“DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.

DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.

...

DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.

来源：https://stackoverflow.com/questions/42468984/what-are-the-requirements-for-using-shfl-operations-on-amd-gpu-using-hip-c

标签

concurrency

llvm

gpgpu

amd

hip