just curious to know which CPU architectures support compare and swap atomic primitives?
A few people commented/asked about whether the "lock" prefix is needed on x86/x64 for cmpxchg. The answer is yes for multicore machines. The instruction is completely atomic for single core machines without lock.
It's been a while since I studied this stuff that deeply but I seem to remember that the instruction is technically restartable - it can abort the instruction mid-flight (if it hasn't had any side effects yet) to avoid delaying interrupt handling for too long.