Testing parallel_for_ performance in OpenCV

前端 未结 1 958
忘掉有多难
忘掉有多难 2021-01-06 00:59

I tested parallel_for_ in OpenCV by comparing with the normal operation for just simple array summation and multiplication.

I have array of 100 integer

相关标签:
1条回答
  • 2021-01-06 01:35

    Let me do some considerations:

    Accuracy

    clock() function is not accurate at all. Its tick is roughly 1 / CLOCKS_PER_SEC but how often it's updated and if it's uniform or not it's system and implementation dependent. See this post for more details about that.

    Better alternatives to measure time:

    • This post for Windows.
    • This article for *nix.

    Trials and Test Environment

    Measures are always affected by errors. Performance measurement for your code is affected (short list, there is much more than that) by other programs, cache, operating system jobs, scheduling and user activity. To have a better measure you have to repeat it many times (let's say 1000 or more) then calculate average. Moreover you should prepare your test environment to be as clean as possible.

    More details about tests on these posts:

    • How do I write a correct micro-benchmark in Java?
    • NAS Parallel Benchmarks
    • Visual C++ 11 Beta Benchmark of Parallel Loops (for code examples)
    • Great articles from our Eric Lippert about benchmarking (it's about C# but most of them applies directly to any bechmark): C# Performance Benchmark Mistakes (part II).

    Overhead and Scalability

    In your case overhead for parallel execution (and your test code structure) is much higher that loop body itself. In this case it's not productive to make an algorithm parallel. Parallel execution must always be evaluated in a specific scenario, measured and compared. It's not kind of magic medicine to speed up everything. Take a look to this article about How to Quantify Scalability.

    Just for example if you have to sum/multiply 100 numbers it's better to use SIMD instructions (even better within an unrolled loop).

    Measure It!

    Try to make your loop body empty (or to execute a single NOP operation or volatile write so it won't be optimized away). You'll roughly measure overhead. Now compare it with your results.

    Notes About This Test

    IMO this kind of test is pretty useless. You can't compare, in a generic way, serial or parallel execution. It's something you should always check against a specific situation (in real world many things will play, synchronization for example).

    Imagine: you make your loop body really "heavy" and you'll see a big speed up with parallel execution. Now you make your real program parallel and you see performance is worse. Why? Because parallel execution is slowed down by locks, by cache problems or serial access to a shared resource.

    Test itself is meaningless unless you're testing your specific code in your specific situation (because too many factors will play and you just can't ignore them). What it means? Well that you can compare only what you tested...if your program performs total *= buffertoClip[i]; then your results are reliable. If your real program does something else then you have to repeat tests with that something else.

    0 讨论(0)
提交回复
热议问题