I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)