I need a very fast (in the sense \"low cost for reader\", not \"low latency\") change notification mechanism between threads in order to update a read cache:
You do have to use a memory fence here. Without the fence, there is no guarantee updates will be ever seen on the other thread. In C++03 you have the option of either using platform-specific ASM code (mfence on Intel, no idea about ARM) or use OS-provided atomic set/get functions.