I\'m trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out
In general, you should make sure that data used on different cpus are not shared (easy) and are not on the same cache line (false sharing, see for example here: False Sharing is No Fun). Data used by the same cpu should be close together to benefit from caches.