There is a popular spin-lock mutex version which is spreaded across the Internet and which one might encounter in the Anthony Williams book(C++ Concurrency in Action). Here
I think what you're missing is that test_and_set is atomic, period. There is no memory ordering setting that makes this operation not atomic. If all we needed was an atomic test and set, we could specify any memory ordering.
However, in this case, we need more than just an atomic "test and set" operation. We need to ensure that memory operations we performed after we confirmed that the lock was ours to take aren't re-ordered to before we observed the mutex to be unlocked. (Because those operations won't be atomic operations.)
Consider:
What is the one thing that must not happen? It's that the reads and writes in step 6 and 7 somehow get re-ordered to before step 5, stepping on another thread accessing shared data under protection of the mutex.
The test_and_set operation is already atomic, so steps 4 and 5 are inherently safe. And steps 1 and 2 can't modify protected data (because they occur before we even try to lock) so there's no harm in re-ordering them around our lock operation.
But steps 6 and 7 -- those must not be re-ordered to prior to us observing that the lock was unlocked so that we could atomically lock it. That would be a catastrophe.
The definition of memory_order_acquire: "A load operation with this memory order performs the acquire operation on the affected memory location: no memory accesses in the current thread can be reordered before this load."
Exactly what we need.