I\'m trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out
Some small generic tops:
Any data structure that is shared across multiple processors should be read only.
Any data structure that requires modification is unique to the processor and does not share memory locality with data that is required by another processor.
Make sure your memory is arranged so that your code scans serially through it (not taking huge steps or jumping around).