Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:
INC EAX
gets executed as a single "mini-op" like uOp.inc eax
(let me call it that - they're not exposed).
For other operands things will look differently, like:
INC DWORD PTR [ EAX ]
the low-level decomposition though would look more like:
uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg
and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ]
, that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.
The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?