Synchronise push_back and std::thread

My code

void build(std::vector<RKD <DivisionSpace> >& roots, ...) {
  try {
    // using a local lock_guard to lock mtx guarantees unlocking on destruction / exception:
    std::lock_guard<std::mutex> lck (mtx);
    roots.push_back(RKD<DivisionSpace>(...));
  }
  catch (const std::bad_alloc&) {
    std::cout << "[exception caught when constructing tree]\n";
    return;
  }
}

Now, the actual work should be done serially, not in parallel.

The constructor of RKD can run in parallel with other constructors of RKD. However, pushing the objects back in std::Vector is a critical section, right?

The number of the objects I am going to be build is known. It is going to be something in the range [2, 16] in practise. In theory it could be any positive number.

Also, I am not interesting for the order they will be inserted in the container.

So I could do something like:

RKD tree = RKD(...);
mutex_lock(...);
roots.push_back(tree);

However this would imply copying, wouldn't it?

What should I do to make my code parallel?

I decided to use a lock (instead of just a mutex) because of this answer.

The suggestion that Tomasz Lewowski has brought up in his comment and I have expanded upon is pretty simple and based upon the following observation: A push_back on a std::vector potentially needs to re-allocate the backing store and copy (or, preferably, move) the elements. This constitutes a critical section that needs to be synchronized.

For the next examples, assume we want to have a vector filled with the first 12 primes but we don't care about their ordering. (I have just hard-coded the numbers here but assume they are obtained via some expensive computation that makes sense to do in parallel.) There is a dangerous race condition in the following scenario.

std::vector<int> numbers {};  // an empty vector

// thread A             // thread B             // thread C

numbers.push_back( 2);  numbers.push_back(11);  numbers.push_back(23);
numbers.push_back( 3);  numbers.push_back(13);  numbers.push_back(27);
numbers.push_back( 5);  numbers.push_back(17);  numbers.push_back(29);
numbers.push_back( 7);  numbers.push_back(19);  numbers.push_back(31);

There is also another problem with the push_back. If two threads call it simultaneously, they will both attempt to construct an object at the same index with potentially disastrous consequences. So the problem is not solved with a reserve(n) before forking the threads.

However, since you know the number of elements in advance, you can simply assign them to a specific location inside a std::vector without changing its size. If you don't change the size, there is no critical section. Therefore, there is no race in the following scenario.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 1] =  3;    numbers[ 2] =  5;
numbers[ 3] =  7;    numbers[ 4] = 11;    numbers[ 5] = 13;
numbers[ 6] = 17;    numbers[ 7] = 19;    numbers[ 8] = 23;
numbers[ 9] = 29;    numbers[10] = 31;    numbers[11] = 37;

Of course, if both threads attempt to write to the same index, the race will be there again. Fortunately, protecting against this is not difficult in practice. If your vector has n elements and you have p threads, thread i writes only to elements [i n / p, (i + 1) n / p). Note that this is preferable over having thread i write to elements at index j only if j mod p = i because it leads to fewer cache invalidations. So the access pattern in the above example is sub-optimal and had better been like this.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 4] = 11;    numbers[ 8] = 23;
numbers[ 1] =  3;    numbers[ 5] = 13;    numbers[ 9] = 29;
numbers[ 2] =  5;    numbers[ 6] = 17;    numbers[10] = 31;
numbers[ 3] =  7;    numbers[ 7] = 19;    numbers[11] = 37;

So far so good. But what if you don't have a std::vector<int> but a std::vector<Foo>? If Foo does not have a default constructor, then

std::vector<Foo> numbers(10);

will be invalid. And even if it has one, it would be outrageous to create many expensive default-constructed objects just to re-assign them soon without ever retrieving the value.

Of course, most well-designed classes should have a very cheap default constructor. For example, a std::string is default constructed to an empty string that requires no memory allocation. A good implementation will reduce the cost of default-constructing a string to just

std::memset(this, 0, sizeof(std::string));

And if the compiler is smart enough to figure out that we are allocating and initializing an entire std::vector<std::string>(n), it might be able to optimize this further to a single call to

std::calloc(n, sizeof(std::string));

So if there is any chance you can make Foo be cheaply default-constructible and assignable, you are done. However, if this turns out to be difficult, you can avoid the problem by moving it to the heap. A smart pointer is cheaply default-constructible, so

std::vector<std::unique_ptr<Foo>> foos(n);

will eventually reduce to a

std::calloc(n, sizeof(std::unique_ptr<Foo>));

without you doing anything to Foo. Of course, this convenience comes at the price of a dynamic memory allocation for each element.

std::vector<std::unique_ptr<Foo>> foos(n);

// thread A                    // thread B                           // thread C

foos[0].reset(new Foo {...});  foos[n / 3 + 0].reset(new Foo {...});  foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...});  foos[n / 3 + 1].reset(new Foo {...});  foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...});  foos[n / 3 + 2].reset(new Foo {...});  foos[2 * n / 3 + 2].reset(new Foo {...});
...                            ...                                    ...

This might not be as bad as you might think because while dynamic memory allocations are not free, the sizeof a std::unique_ptr is very small so if sizeof(Foo) is large, you get the bonus of a more compact vector that is faster to iterate. It all depends of course how you intend to use your data.

If you don't know the exact number of elements in advance or are afraid you'll mess up the indexing, there is yet another way to do it: Have each thread fill its own vector and merge them at the end. Continuing the primes example, we get this.

std::vector<int> numbersA {};  // private store for thread A
std::vector<int> numbersB {};  // private store for thread B
std::vector<int> numbersC {};  // private store for thread C

// thread A              // thread B              // thread C

numbersA.push_back( 2);  numbersB.push_back(11);  numbersC.push_back(23);
numbersA.push_back( 3);  numbersB.push_back(13);  numbersC.push_back(27);
numbersA.push_back( 5);  numbersB.push_back(17);  numbersC.push_back(29);
numbersA.push_back( 7);  numbersB.push_back(21);  numbersC.push_back(31);

// Back on the main thread after A, B and C are joined:

std::vector<int> numbers(
    numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());

// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.

(The std::move used in the above code is the one from the algorithms library.)

This approach has the most desirable memory access pattern of all because numbersA, numbersB and numbersC are writing to completely independently allocated memory. Of course, we got to do the additional sequential work of joining the intermediate results. Note that the efficiency relies heavily on the fact that the cost of move-assigning an element is negligible compared to the cost of finding / creating it. At least as written above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can again use smart pointers.

I hope this provided you with enough ideas to optimize your problem.

If you have never uses smart pointers before, have a look at “RAII and smart pointers in C++” and check out the standard library's dynamic memory management library. The techniques shown above would of course also work with a std::vector<Foo *> but we don't use resource owning raw pointers like this in modern C++ any more.

The problem appears to be that your contructor is doing a lot of work and this breaks all kinds of library conventions around construction and container insertion.

Just fix it by decoupling the insertion from the creation.

The below code is very similar to the code suggested by @5gon12eder except that it doesn't "force" you to change object locality.

In my little demo

we use a raw region of memory that's trully uninitialized (this is not possible with vector, where insertion implies initialization), so instead of the "canonical"

std::array<RKD, 500> rkd_buffer; // OR
std::vector<RKD> rkd_buffer(500); // OR even
std::unique_ptr<RKD[]> rkd_buffer(new RKD[500]);

we're gonna use a custom combination:

std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(
    static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)),
    ::free
);

we then create a few threads (5 in the sample) to construct all the elements. The items are just constructed in-place and their respective destructors will be invoked at program exit
it is, therefore, crucial that all items have been fully initialized before rkd_buffer goes out of scope (the join ensures this here).
the threads could synchronize by different means: constructions could e.g. be dispatched via a work queue to a thread pool, where either condition variables, promises, thread barriers (from boost) or even just atomic shared counters could be used for the coordination.

All these choices are in essence unrelated to the task of getting construction to run in parallel, so I'll leave that to your imagination (or other SO answers)

Live On Coliru

struct RKD {
    RKD() { this_thread::sleep_for(chrono::milliseconds(rand() % 100)); } // expensive
};

int main() {
    static const int N         = 500;
    static const int ChunkSize = 100;
    std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)), ::free);

    vector<thread> group;
    for (int chunk = 0; chunk < N/ChunkSize; chunk += ChunkSize)
        group.emplace_back([&] { 
            for (int i=chunk * ChunkSize; i<(ChunkSize + chunk*ChunkSize); ++i)
                new (rkd_buffer.get() + i) RKD;
        });

    for (auto& t:group) if (t.joinable()) t.join();

    // we are responsible for destructing, since we also took responsibility for construction
    for (RKD& v : *rkd_buffer)
        v.~RKD();
}

You can see that there are 5 threads dividing 500 constructions. Each construction takes (on average) ~50ms so the total time taken should be 100*50ms ~= 5s. This is in fact precisely what happens:

real    0m5.193s
user    0m0.004s
sys 0m0.000s

来源：https://stackoverflow.com/questions/27887654/synchronise-push-back-and-stdthread

标签

c++

multithreading

c++11

vector

synchronization