Are some implementations better than others for specific applications? Is there anything to earn by rolling out your own?
A bit of assembly to demonstrate locking atomically:
; BL is the mutex id
; shared_val, a memory address
CMP [shared_val],BL ; Perhaps it is locked to us anyway
JZ .OutLoop2
.Loop1:
CMP [shared_val],0xFF ; Free
JZ .OutLoop1 ; Yes
pause ; equal to rep nop.
JMP .Loop1 ; Else, retry
.OutLoop1:
; Lock is free, grab it
MOV AL,0xFF
LOCK CMPXCHG [shared_val],BL
JNZ .Loop1 ; Write failed
.OutLoop2: ; Lock Acquired