I\'m using opencv242 + VS2010 by a notebook.
I tried to do some simple test of the GPU block in OpenCV, but it showed the GPU is 100 times slower than CPU codes.
In this
cvtColor isn't doing very much work, to make grey all you have to is average three numbers.
The cvColor code on the CPU is using SSE2 instructions to process upto 8 pixels at once and if you have TBB it's using all the cores/hyperthreads, the CPU is running at 10x the clock speed of the GPU and finally you don't have to copy data onto the GPU and back.
cvtColour is a small operation, and any performance boost you get from doing it on the GPU is vastly outweighed by memory transfer times between host (CPU) and device (GPU). Minimizing the latency of this memory transfer is a primary challenge of any GPU computing.
try to run more than once....
-----------excerpt from http://opencv.willowgarage.com/wiki/OpenCV%20GPU%20FAQ Perfomance
Why first function call is slow?
That is because of initialization overheads. On first GPU function call Cuda Runtime API is initialized implicitly. Also some GPU code is compiled (Just In Time compilation) for your video card on the first usage. So for performance measure, it is necessary to do dummy function call and only then perform time tests.
If it is critical for an application to run GPU code only once, it is possible to use a compilation cache which is persistent over multiple runs. Please read nvcc documentation for details (CUDA_DEVCODE_CACHE environment variable).
Most answers above are actually wrong. The reason why it is slow by a factor 20.000 is of course not because of 'CPU clockspeed is faster' and 'it has to copy it to the GPU' (accepted answers). These are factors, but by saying that you omit the fact that you have vastly more computing power for a problem that is disgustingly parallel. Saying 20.000x performance difference is because of the latter is just so plain ridiculous. The author here knew something was wrong that's not straight forward. Solution:
Your problem is that CUDA needs to initialize! It will always initialize for the first image and generally takes between 1-10 seconds, depending on the alignment of Jupiter and Mars. Now try this. Do the computation twice and then time them both. You will probably see in this case that the speeds are within the same order of magnutide, not 20.000x, that's ridiculous. Can you do something about this initialization? Nope, not that I know of. It's a snag.
edit: I just re-read the post. You say you're running on a notebook. Those often have shabby GPU's, and CPU's with a fair turbo.
What GPU do you have?
Check compute compability, maybe it's the reason.
https://developer.nvidia.com/cuda-gpus
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT’ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT’ed. For devices with CC 1.0, no code is available and the functions throw Exception. For platforms where JIT compilation is performed first, the run is slow.
http://docs.opencv.org/modules/gpu/doc/introduction.html