I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur).
It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!