2 threads slower than 1?

I was playing around with std::thread and something weird popped up:

#include <thread>

int k = 0;

int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});
    std::thread t2([]() { while (k < 1000000000) { k = k + 1; }});

    t1.join();
    t2.join();

    return 0;
}

When compiling the above code with no optimizations using clang++, I got the following benchmarks:

real 0m2.377s  
user 0m4.688s  
sys  0m0.005s

I then changed my code to the following: (Now using only 1 thread)

#include <thread>

int k = 0;

int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});

    t1.join();

    return 0;
}

And these were the new benchmarks:

real 0m2.304s
user 0m2.298s
sys  0m0.003s

Why is the code utilizing 2 threads slower than the code utilizing 1?

You have two threads fighting over the same variable, k. So you are spending time where the processors say "Processor 1: Hey, do you know what value k has? Processor 2: Sure, here you go!", ping-ponging back and forth every few updates. Since k isn't atomic, there's also no guarantee that thread2 doesn't write an "old" value of k so that next time thread 1 reads the value, it jumps back 1, 2, 10 or 100 steps, and has to do it over again - in theory that could lead to neither of the loops every finishing, but that would require quite a bit of bad luck.

This should really be a comment in reply to Mats Petersson's answer, but I wanted to supply code examples.

The problem is the contention of a specific resource, and also a cacheline.

Alternative 1:

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;

    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&k]() { // capture k by reference so we all use the same k.
           while (k < ITERATIONS) {
               k++;
           }
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here the threads contend for a single variable, performing both read and write which forces it to ping-pong causing contention and making the single threaded case the most efficient.

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;

    std::atomic<uint64_t> k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&]() {
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here we divide the labor deterministically (we fall afoul of cases where numThreads is not a divisor of ITERATIONS but it's close enough for this demonstration). Unfortunately, we are still encountering contention for access to the shared element in memory.

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;
    std::vector<uint64_t> ks;

    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &ks]() {
           auto& k = ks[t];
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }

    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
        k += ks[t];
    }
    return 0;
}

Again this is deterministic about the distribution of the workload, and we spend a small amount of effort at the end to collate the results. However we did nothing to ensure the distribution of counters favors healthy CPU distribution. For that:

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>

static const uint64_t ITERATIONS = 10000000000ULL;
#define CACHE_LINE_SIZE 128

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;
    std::mutex kMutex;
    uint64_t k = 0;

    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &k]() {
           alignas(CACHE_LINE_SIZE) uint64_t myK = 0;
           // Imperfect division of labor, we'll fall short in some cases.
           for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) {
               myK++;
           }
           kMutex.lock();
           k += myK;
           kMutex.unlock();
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here we avoid contention between threads down to the cache line level, except for the single case at the end where we use a mutex to control synchronization. For this trivial workload, the mutex is going to have one hell of a relative cost. Alternatively, you could use alignas to provide each thread with its own storage at the outer scope and summarize the results after the joins, eliminating the need for the mutex. I leave that as an exercise to the reader.

Seems to me like the more important question than "why didn't this work?" is "How do I get this to work?" For the task at hand, I think std::async (despite significant shortcomings) is really a better tool than using std::thread directly.

#include <future>
#include <iostream>

int k = 0;
unsigned tasks = std::thread::hardware_concurrency();
unsigned reps = 1000000000 / tasks;

int main() {
    std::vector<std::future<int>> f;

    for (int i=0; i<tasks; i++)
        f.emplace_back(std::async(std::launch::async, 
                                  [](){int j; for (j=0; j<reps; j++); return j;})
                      );

    for (int i=0; i<tasks; i++) {
        f[i].wait();
        k += f[i].get();
    }

    std::cout << k << "\n";
    return 0;
}

I run into this problem. My opinion is that for certain type of job the cost of managing thread may be more than the benefit you get from running in threads. Here is my code example, doing some real job in a loop large number of iterations, so I got very consistent number with the time command.

   pair<int,int> result{0,0};
#ifdef USETHREAD
      thread thread_l(&Myclass::trimLeft, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.first));
      thread thread_r(&Myclass::trimRight, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.second));
      thread_l.join();
      thread_r.join();
#else
      // non threaded version faster
      trimLeft(fsq, oriencnt, result.first);
      trimRight(fsq, oriencnt, result.second);
#endif

   return result;

The time results

Thead          No_thread
===========================    
Real  4m28s          2m49s
usr   0m55s          2m49s
sys   0m6.2s         0m0.012s

I am ignoring the decimal for seconds for large ones. My code is only updating one shared variable oriencnt. I have not letting it updating the fsq yet. It looks that in threaded version, the system is doing more work which resulted longer clock time (real time). My compiler flag is the default -g -O2, not sure this is the key problem or not. When compiled with -O3 the difference is minimal. There is also some mutex controlled IO operation. My experiment shows that this does not contribute to the difference. I am using gcc 5.4 with C++11. One possibility is that the library is not optimized.

Here is compiled with O3

       Thead    No_thread
=========================
real   4m24        2m44s
usr    0m54s       2m44s
sys    0m6.2s      0m0.016s

来源：https://stackoverflow.com/questions/17137115/2-threads-slower-than-1

标签

c++

multithreading

performance

c++11

stdthread