I hear frequently that accessing a shared memory segment between processes has no performance penalty compared to accessing process memory between threads. In other words, a
If one considers what is happening at the microelectronics level when two threads or processes are accessing the same memory, there's some interesting consequences.
The point of interest is how the architecture of the CPU allows multiple cores (thus threads and processes) access the same memory. This is done through the L1 caches, then the L2, L3 and finally DRAM. There's an awful lot of coordination has to go on between the controllers of all of that.
For a machine with 2 CPUs or more, that coordination takes place over a serial bus. If one compares the bus traffic that takes place when two cores are accessing the same memory, and when data is being copied to another piece of memory, it's about the same amount of traffic.
So depending on where in a machine the two threads are running, there can be little speed penalty to copying the data vs sharing it.
Copying might be 1) a memcpy, 2) a pipe write, 3) an internal DMA transfer (Intel chips can do this these days).
An internal DMA is interesting because it requires zero CPU time (a naive memcpy is just a loop, actually takes time). So if one can copy data instead of sharing data, and one does this with an internal DMA, you can be just as fast as if you were sharing data.
The penalty is more RAM, but the payback is that things like Actor model programming are in play. This is a way to remove all the complexity of guarding shared memory with semaphores from your program.