I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur).
One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.