Every Modern OS provides today some atomic operations:
Interlocked* API
I recently did an implementation of such a thing and I was confronted to the same difficulties as you are. My solution was basically the following:
cmpxch with __asm__ for the other architectures (ARM is a bit more complicated than that). Just do that for one possible size, e.g sizeof(int).inline functions