Every Modern OS provides today some atomic operations:
Interlocked* API
Darn. I was going to suggest the GCC primitives, then you said they were off limits. :-)
In that case, I would do an #ifdef for each architecture/compiler combination you care about and code up the inline asm. And maybe check for __GNUC__ or some similar macro and use the GCC primitives if they are available, because it feels so much more right to use those. :-)
You are going to have a lot of duplication and it might be difficult to verify correctness, but this seems to be the way a lot of projects do this, and I've had good results with it.
Some gotchas that have bit me in the past: when using GCC, don't forget "asm volatile" and clobbers for "memory" and "cc", etc.