GPU with OpenCL is slower than CPU. Why?

こ雲淡風輕ζ 提交于 2021-01-29 09:08:18

问题


Environment:

  • Intel i7-9750H
  • Intel UHD Graphics 630
  • Nvidia GTX1050 (Laptop)
  • Visual studio 2019 / C++
  • OpenCV 4.4
  • OpenCL 3.0 (intel) / 1.2 (nvidia)

I'm trying to use OpenCL to speed up my code. But the result shows CPU is faster than GPU. How could I speed up my code?

void GetHoughLines(cv::Mat dst) {
    cv::ocl::setUseOpenCL(true);

    int img_w = dst.size().width; // 5000
    int img_h = dst.size().height; // 4000

    cv::UMat tmp_dst = dst.getUMat(cv::ACCESS_READ);
    cv::UMat tmp_mat = cv::UMat(dst.size(), CV_8UC1, cv::Scalar(0));

    for (size_t i = 0; i < 1000; i++)
    {
        tmp_mat = tmp_mat.mul(tmp_dst);
    }
}

It took about 3000ms when I used only CPU. When I used Intel UHD Graphics 630, it took 3500ms. And I also tried GTX1050, but it took about 3000ms.

Please give me some ideas to speed it up. I should make it at least 1000ms. Should I use AMP or OpenMP? But as I know, they can only compute simple operations, not suitable for OpenCV functions.


回答1:


Basically, Your code is slow because the way OpenCV uses OpenCL is inefficient. It has nothing to do with the underlying hardware.

In order for OpenCL code (or any GPU related code for that matter) to be efficient, it is crucial for the host side code to properly utilize the GPU. To name a few principles:

  • Saturate the GPU by asynchronously enqueuing many computations (kernels).
  • Avoid unnecessary synchronizations.
  • Avoid unnecessary memory copies between host CPU and GPU device.

Even if you write the most optimized GPU kernels, but fail to adhere to these basics, you are very unlikely to gain any performance boosts.

The OpenCV codebase is a great example of how not to adhere to these principles.

As for your example, if you rewrite your code to avoid memory copies and use device memory explicitly, you might witness a reasonable performance:

auto frame1 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame2 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame3 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);

for (size_t i = 0; i < 10; i++)
{
    cv::multiply(frame1, frame2, frame3);
}

But in any case, I recommend you learn using the OpenCL API without OpenCV.



来源:https://stackoverflow.com/questions/64907132/gpu-with-opencl-is-slower-than-cpu-why

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!