For something simple like a counter if multiple threads will be increasing the number. I read that mutex locks can decrease efficiency since the threads have to wait. So, to
A minimal (standards compliant) mutex implementation requires 2 basic ingredients:
There is no way you can make it any simpler than this because of the 'synchronizes-with' relationship the C++ standard requires.
A minimal (correct) implementation might look like this:
class mutex {
    std::atomic<bool> flag{false};
public:
    void lock()
    {
        while (flag.exchange(true, std::memory_order_relaxed));
        std::atomic_thread_fence(std::memory_order_acquire);
    }
    void unlock()
    {
        std::atomic_thread_fence(std::memory_order_release);
        flag.store(false, std::memory_order_relaxed);
    }
};
Due to its simplicity (it cannot suspend the thread of execution), it is likely that, under low contention, this implementation outperforms a std::mutex.
But even then, it is easy to see that each integer increment, protected by this mutex, requires the following operations:
atomic store to release the mutexatomic compare-and-swap (read-modify-write) to acquire the mutex (possibly multiple times)If you compare that with a standalone std::atomic<int> that is incremented with a single (unconditional) read-modify-write (eg. fetch_add),
it is reasonable to expect that an atomic operation (using the same ordering model)  will outperform the case whereby a mutex is used.