I hear frequently that accessing a shared memory segment between processes has no performance penalty compared to accessing process memory between threads. In other words, a
The cost of shared memory is proportional to the number of "meta" changes to it: allocation, deallocation, process exit, ...
The number of memory accesses does not play a role. An access to a shared segment is as fast as an access anywhere else.
The CPU performs the page table mapping. Physically, the CPU does not know that the mapping is shared.
If you follow the best-practice (which is to rarely change the mapping) you get basically the same performance as with process-private memory.