I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.
What is the absolute fastest wa
If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).
If you want to be OS agnostic, try OpenGL.
Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.