GLSL: about coherent qualifier

问题

I didn't get clearly how coherent qualifier and atomic operations work together.

I perform some accumulating operation on the same SSBO location with this code:

uint prevValue, newValue;
uint readValue = ssbo[index];
do
{
    prevValue = readValue;
    newValue = F(readValue);
}
while((readValue = atomicCompSwap(ssbo[index], prevValue, newValue)) != prevValue);

This code works fine for me, but still, do I need to declare the SSBO (or Image) with coherent qualifier in this case?

And do I need to use coherent in a case when I call only atomicAdd?

When exactly do I need to use coherent qualifier? Do I need to use it only in case of direct writing: ssbo[index] = value;?

回答1:

TL;DR

I found evidence that supports both answers regarding coherent.

Current score:

Requiring coherent with atomics: 1.5
Omitting coherent with atomics: 4.75

Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent is not required in practice. I'm not so sure in these cases:

more than 1 workgroup in glDispatchCompute
multiple glDispatchCompute calls that all access the same memory location (atomically) without any glMemoryBarrier between them

However, is there a performance cost to declaring SSBOs (or individual struct members) coherent when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should either:

ignore coherent when emitting the atomic instructions because it has no effect
use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.

From the OpenGL wiki's "Memory Model" page:

Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like.

+1 for requiring coherent

The code from Intel's article "OpenGL Performance Tips: Atomic Counter Buffers versus Shader Storage Buffer Objects"

// Fragment shader used bor ACB gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;

void main()
{
    for (int i=0; i<  s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
    fragColor = texture(texUnit, texcoord);
}

// Fragment shader used for SSBO gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
    uint v[ s(nCounters) ];
};

void main()
{
    for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
    fragColor = texture(texUnit, texcoord);
}

Notice that ssbo_data in the second shader is not declared coherent.

The article also states:

The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.

So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)

+1 for omitting coherent

GLSL spec

The GLSL spec uses different wording when describing coherent and atomic operations (emphasis mine):

(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.

(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.

All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.

So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent) play a role in what the atomic operation does.

+0.5 for requiring coherent

The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"

The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.

The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.

+1 for omitting coherent

OpenGL spec issue #14

OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier calls to be able to access the counter?

We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.

The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier which plays no part in inside a single glDispatchCompute.

no effect

AMD instruction sets

AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.

With this simple compute shader:

#version 460
layout(local_size_x = 512) in;

layout(binding=0) restrict buffer A
{
    uint count;
    float data[];
} non_coherent_buf;

layout(binding=1) coherent restrict buffer B
{
    uint count;
    float data[];
} coherent_buf;

void main()
{
    // Non-coherent qualified SSBO
    uint read_value1 = atomicAdd(non_coherent_buf.count, 1);

    // coherent qualified SSBO
    uint read_value2 = atomicAdd(coherent_buf.count, 1);
}

We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:

buffer_atomic_add v0, v0, s[8:11], 0 // 000000000034: E1080000 80020000

Decoding this instruction shows that the GLC (Globally Coherent) flag is set to 0 which means, for atomic operations, "Previous data value is not returned. No L1 persistence across wavefronts". Making sure that the returned value is used changes the instructions to set the GLC flag to 1 which means "Previous data value is returned. No L1 persistence across wavefronts".

The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op> instructions:

Buffer object atomic operation. Always globally coherent.

So on AMD hardware at least, it appears coherent has no effect for atomic operations. Note that instructions are different for non-atomic reads and writes depending on the coherent qualifier.

+1 for omitting coherent

NVIDIA

The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:

8.5. Scope

Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:

Table 18. Scopes

.cta: The set of all threads executing in the same CTA as the current thread.

.gpu: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.

.sys The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.

Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.

Regarding CTA:

A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.

So in GLSL terms, CTA == work group and grid == glDispatchCompute call.

The atom instruction description:

9.7.12.4. Parallel Synchronization and Communication Instructions: atom

Atomic reduction operations for thread-to-thread communication.

[...]

The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.

[...]

If no scope is specified, the atomic operation is performed with .gpu scope.

So by default, all shader invocations of a glDispatchCompute would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary). However, since the semantics of cta map to a work group and gpu to buffers (i.e. SSBO, images, etc), I declare:

+0.5 for omitting coherent

SPIR-V (with Vulkan target)

The same shader from the AMD section above compiles to this SPIR-V code using the glslang SPIR-V generator:

// Generated with glslangValidator.exe -H --target-env vulkan1.1 in.comp
in.comp
// Module Version 10000
// Generated by (magic number): 80008
// Id's are bound by 31

                              Capability Shader
               1:             ExtInstImport  "GLSL.std.450"
                              MemoryModel Logical GLSL450
                              EntryPoint GLCompute 4  "main"
                              ExecutionMode 4 LocalSize 512 1 1
                              Source GLSL 460
                              Name 4  "main"
                              Name 8  "read_value1"
                              Name 11  "A"
                              MemberName 11(A) 0  "count"
                              MemberName 11(A) 1  "data"
                              Name 13  "non_coherent_buf"
                              Name 21  "read_value2"
                              Name 23  "B"
                              MemberName 23(B) 0  "count"
                              MemberName 23(B) 1  "data"
                              Name 25  "coherent_buf"
                              Decorate 10 ArrayStride 4
                              MemberDecorate 11(A) 0 Restrict
                              MemberDecorate 11(A) 0 Offset 0
                              MemberDecorate 11(A) 1 Restrict
                              MemberDecorate 11(A) 1 Offset 4
                              Decorate 11(A) BufferBlock
                              Decorate 13(non_coherent_buf) DescriptorSet 0
                              Decorate 13(non_coherent_buf) Binding 0
                              Decorate 22 ArrayStride 4
                              MemberDecorate 23(B) 0 Coherent
                              MemberDecorate 23(B) 0 Restrict
                              MemberDecorate 23(B) 0 Offset 0
                              MemberDecorate 23(B) 1 Coherent
                              MemberDecorate 23(B) 1 Restrict
                              MemberDecorate 23(B) 1 Offset 4
                              Decorate 23(B) BufferBlock
                              Decorate 25(coherent_buf) DescriptorSet 0
                              Decorate 25(coherent_buf) Binding 1
                              Decorate 30 BuiltIn WorkgroupSize
               2:             TypeVoid
               3:             TypeFunction 2
               6:             TypeInt 32 0
               7:             TypePointer Function 6(int)
               9:             TypeFloat 32
              10:             TypeRuntimeArray 9(float)
           11(A):             TypeStruct 6(int) 10
              12:             TypePointer Uniform 11(A)
13(non_coherent_buf):     12(ptr) Variable Uniform
              14:             TypeInt 32 1
              15:     14(int) Constant 0
              16:             TypePointer Uniform 6(int)
              18:      6(int) Constant 1
              19:      6(int) Constant 0
              22:             TypeRuntimeArray 9(float)
           23(B):             TypeStruct 6(int) 22
              24:             TypePointer Uniform 23(B)
25(coherent_buf):     24(ptr) Variable Uniform
              28:             TypeVector 6(int) 3
              29:      6(int) Constant 512
              30:   28(ivec3) ConstantComposite 29 18 18
         4(main):           2 Function None 3
               5:             Label
  8(read_value1):      7(ptr) Variable Function
 21(read_value2):      7(ptr) Variable Function
              17:     16(ptr) AccessChain 13(non_coherent_buf) 15
              20:      6(int) AtomicIAdd 17 18 19 18
                              Store 8(read_value1) 20
              26:     16(ptr) AccessChain 25(coherent_buf) 15
              27:      6(int) AtomicIAdd 26 18 19 18
                              Store 21(read_value2) 27
                              Return
                              FunctionEnd

The only difference between non_coherent_buf and coherent_buf is the decoration of the latter with e.g. OpMemberDecorate %B 0 Coherent. Their usage afterwards is identical.

Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and changes the (abbreviated) output to:

                              MemoryModel Logical VulkanKHR
// removal of MemberDecorate 23(B) 0 Coherent and MemberDecorate 23(B) 1 Coherent

which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:

While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.

Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.

+1 for omitting coherent (maybe?)

Empirical evidence

I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd() and it works. Usage of of coherent was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.

+0.25 for omitting coherent

来源：https://stackoverflow.com/questions/56340333/glsl-about-coherent-qualifier

标签

opengl

synchronization

glsl

atomic