The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics
being memory_order_seq_cst
A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.