Calculating the sum of a large vector in parallel

◇◆丶佛笑我妖孽 提交于 2019-11-30 23:22:35

Is Boost.Asio suitable for this problem?

The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.

I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.

A parallel algorithm

Here's an example of a parallel version of accumulate created by only using the standard library.

/* Minimum number of elements for multithreaded algorithm.
   Less than this and the algorithm is executed on single thread. */
static const int MT_MIN_SIZE = 10000;

template <typename InputIt, typename T>
auto parallel_accumulate(InputIt first, InputIt last, T init) {
    // Determine total size.
    const auto size = std::distance(first, last);
    // Determine how many parts the work shall be split into.
    const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();

    std::vector<std::future<T>> futures;

    // For each part, calculate size and run accumulate on a separate thread.
    for (std::size_t i = 0; i != parts; ++i) {
        const auto part_size = (size * i + size) / parts - (size * i) / parts;
        futures.emplace_back(std::async(std::launch::async,
            [=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
        std::advance(first, part_size);
    }

    // Wait for all threads to finish execution and accumulate results.
    return std::accumulate(std::begin(futures), std::end(futures), init,
        [] (const T prev, auto& future) { return prev + future.get(); });
}

Live example (Parallel version performs about the same as sequential on Coliru, probably only 1 core available)

Timings

On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
5000000050000000

However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int.

OpenMP

As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.

template <typename T, typename U>
auto omp_accumulate(const std::vector<T>& v, U init) {
    U sum = init;

    #pragma omp parallel for reduction(+:sum)
    for(std::size_t i = 0; i < v.size(); i++) {
        sum += v[i];
    }

    return sum;
}

On my machine this method performed the same as the parallel method using standard thread primitives.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
Sum: 5000000050000000
--------------------------------
OpenMP sum:
Time taken: 21 ms
Sum: 5000000050000000

sehe

You can use Boost Asio as a thread pool. But there's not a lot of sense in it unless you have... asynchronous IO operations to coordinate.

In this answer to "c++ work queues with blocking" I show two thread_pool implementations:

  • Solution #1: one based on boost::asio::io_service
  • Solution #2: the other based on boost::thread primitives

Both accept any void() signature compatible task. This means, you could wrap your function-that-returns-the-important-results in a packaged_task<...> and get the future<RetVal> from it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!